Data Science & Prototyping Workflow¶
A key feature of the Airnub Prefect Starter template is its integrated structure that supports both data science exploration/prototyping and the operationalization of that work into robust Prefect data pipelines. This document outlines the intended workflow.
The Dual Structure¶
The template provides distinct areas for these two phases:
-
Data Science & Prototyping (
airnub_prefect_starter/data_science/¬ebooks/):airnub_prefect_starter/data_science/: This Python sub-package contains scripts inspired by common data science project structures (e.g., Cookiecutter Data Science).config_ds.py: Defines file system paths for various data stages (raw, processed, interim, models, etc.), typically within the project's rootdata/directory.dataset_processing.py: Placeholder for scripts that perform initial data cleaning, transformation, and preparation (e.g., creating primary datasets from raw data).feature_engineering.py: Placeholder for scripts that generate features for machine learning models.modeling/: Contains scripts for model training (train.py) and prediction/inference (predict.py).visualization_scripts.py(orplots.py): Placeholder for scripts that generate plots and visualizations.
notebooks/: The top-level directory for Jupyter notebooks. This is ideal for iterative exploration, ad-hoc analysis, and experimenting with different approaches before formalizing them into scripts.
-
Prefect Operationalization (
flows/&airnub_prefect_starter/core/):airnub_prefect_starter/core/: This Python sub-package is where you place reusable, Prefect-agnostic core logic functions that encapsulate the business rules and data transformations derived from your data science work.flows/: This top-level directory contains all your Prefect flow definitions and their department/category-specific task wrappers. These task wrappers are typically thin calls to the functions inairnub_prefect_starter/core/.
The Intended Workflow: From Research to Production¶
-
Exploration & Prototyping (Data Science Phase):
- Use Jupyter notebooks in the
notebooks/directory for initial data exploration, visualization, and trying out different algorithms or processing steps. - Formalize repeatable data processing, feature engineering, or model training steps into Python scripts within
airnub_prefect_starter/data_science/(e.g.,dataset_processing.py,feature_engineering.py,modeling/train.py). - These scripts utilize
airnub_prefect_starter/data_science/config_ds.pyto manage paths to raw, interim, processed data, and models, typically stored within the project'sdata/andmodels/directories. - Run these scripts standalone (e.g.,
python -m airnub_prefect_starter.data_science.train_model) for development and iteration.
- Use Jupyter notebooks in the
-
Refactoring for Operationalization (Core Logic):
- Once a piece of data processing logic, feature engineering step, or model inference routine is stable and well-understood from the prototyping phase, refactor its core functionality into reusable Python functions.
- Place these functions in appropriate modules within
airnub_prefect_starter/core/(e.g., data cleaning functions incore/data_cleaning.py, model inference logic incore/model_inference.py). - These core functions should be well-tested and independent of Prefect decorators.
-
Creating Prefect Tasks (Wrappers):
- Use the
scripts/generators/add_task.pyscript to create new department/category-specific Prefect task wrappers within the relevantflows/dept_.../.../tasks/directory. - These task wrappers will import and call the core logic functions you created in
airnub_prefect_starter/core/. - The task wrapper handles Prefect-specific concerns like logging (
get_run_logger()), retries, and parameterization from flow configurations.
- Use the
-
Building Prefect Flows (Orchestration):
- Use
scripts/generators/add_category.pyandscripts/generators/add_department.pyto scaffold your Prefect flows. - Implement your category and parent stage flows in the
flows/directory to orchestrate these tasks. - Configure your flows using YAML files in
configs/variables/(which become Prefect Variables) and Prefect Blocks for secrets/infrastructure.
- Use
-
Deployment & Scheduling:
- Define deployments for your parent stage flows in
prefect.local.yaml(for local execution) or other deployment manifests. - Run and schedule your flows using the Prefect UI or CLI.
- Define deployments for your parent stage flows in
Benefits of this Approach¶
- Separation of Concerns: Keeps exploratory/prototyping code distinct from production workflow code.
- Rapid Prototyping: Data scientists can work quickly in notebooks and standalone scripts without needing to immediately understand all of Prefect's intricacies.
- Robust Operationalization: Mature logic is refactored into testable core functions and then reliably orchestrated by Prefect.
- Clear Path to Production: Provides a structured way to move from an idea prototyped in a data science environment to a scheduled, monitored Prefect data pipeline.
This integrated workflow leverages the strengths of both traditional data science project structures and modern workflow orchestration with Prefect.