Build your first ETL pipeline
In this tutorial, you'll build an ETL pipeline with Dagster that:
- Imports sales data to DuckDB
- Transforms data into reports
- Runs scheduled reports automatically
- Generates one-time reports on demand
You will learn to:
- Set up a Dagster project with the recommended project structure
- Create and materialize assets
- Create and materialize dependant assets
- Ensure data quality with asset checks
- Create and materialize partitioned assets
- Automate the pipeline
- Create and materialize a sensor asset
- Refactor your project when it becomes more complex
Prerequisites
To follow the steps in this guide, you'll need:
- Basic Python knowledge
- Python 3.9+ installed on your system. Refer to the Installation guide for information.
- Familiarity with SQL and Python data manipulation libraries, such as Pandas.
- Understanding of data pipelines and the extract, transform, and load process.
Step 1: Set up your Dagster environment
First, set up a new Dagster project.
-
Open your terminal and create a new directory for your project:
mkdir dagster-etl-tutorial
cd dagster-etl-tutorial -
Create and activate a virtual environment:
- MacOS
- Windows
python -m venv dagster_tutorial
source dagster_tutorial/bin/activatepython -m venv dagster_tutorial
dagster_tutorial\Scripts\activate -
Install Dagster and the required dependencies:
pip install dagster dagster-webserver pandas dagster-duckdb
Step 2: Create the Dagster project structure
Run the following command to create the project directories and files for this tutorial:
dagster project from-example --example getting_started_etl_tutorial
Your project should have this structure:
dagster-etl-tutorial/
├── data/
│ └── products.csv
│ └── sales_data.csv
│ └── sales_reps.csv
│ └── sample_request/