In this guide, we'll walk through a fully featured Dagster project that takes advantage of a wide range of Dagster features. This example can be useful as a point of reference for using different Dagster APIs and integrating other tools.
At a high level, this project consists of three asset groups, all centered around a contrived organization that wants to do ML and analysis on Hacker News user activity data.
To follow along with this guide, you can bootstrap your own project with this example:
dagster project from-example \ --name my-dagster-project \ --example project_fully_featured
To install this example and its Python dependencies, run:
cd my-dagster-project pip install -e .
Once you've done this, you can run:
dagit
to view this example in Dagster's UI, Dagit.
This example shows useful patterns for many Dagster concepts, including:
Software-defined assets - An asset is a software object that models a data asset. The prototypical example is a table in a database or a file in cloud storage.
This example contains three asset groups:
core
: Contains data sets of activity on Hacker News, fetched from the Hacker News API. These are partitioned by hour and updated every hour.
recommender
: A machine learning model that recommends stories to specific users based on their comment history, as well as the features and training set used to fit that model. These are dropped and recreated whenever the core
assets receive updates.
activity_analytics
: Aggregate statistics computed about Hacker News activity represented by dbt models and a Python model that depends on them. These are dropped and recreated whenever the core
assets receive updates.
Resources - A resource is an object that models a connection to a (typically) external service. Resources can be shared between assets, and different implementations of resources can be used depending on the environment. In this example, we built multiple Hacker News API resources, all of which have the same interface but different implementations:
HNAPIClient
interacts with the real Hacker News API and gets the full data set, which will be used in production.HNAPISubsampleClient
talks to the real API but subsamples the data, which is much faster than the normal implementation and is great for demoing purposes.HNSnapshotClient
reads from a local snapshot, which is useful for unit testing or environments where the connection isn't available.The way we model resources helps separate the business logic in code from environments, e.g. you can easily switch resources without changing your pipeline code.
I/O managers - An I/O manager is a special kind of resource that handles storing and loading assets. This example includes a wide range of I/O managers such as:
DuckDBPartitionedParquetIOManager
: interacts with Spark and dbt without any long-running process. It minimizes setup difficulty and is useful for local development.SnowflakeIOManager
: handles outputs that are either Spark or Pandas DataFrames and write data to a Snowflake table specified by metadata on the relevant Out
. The metadata is helpful for observability, especially in production. Schedules - A schedule allows you to execute a job at a fixed interval. This example includes an hourly schedule that materializes the core
asset group every hour.
Sensors - A sensor allows you to instigate runs based on some external state change. In this example, we have sensors to react to different state changes:
Testing - All Dagster entities are unit-testable. This example illustrates lightweight invocations in unit tests, including:
@asset
-decorated functions. Read more about testing assets on the Testing page.OutputContext
and InputContext
with the mocks. Check out Testing an IO manager to learn more.This example is meant to be loaded from three deployments:
By default, it will load for the local deployment. You can toggle deployments by setting the DAGSTER_DEPLOYMENT
env var to prod
or staging
.
Beyond leveraging Dagster core concepts, this project also uses several dagster integration libraries:
dbt_project
, and loads dbt models from an existing dbt manifest.json
file in the dbt project to Dagster assets. It is useful for larger dbt projects as you may not want to recompile the entire dbt project every time you load the Dagster project.PartitionedParquetIOManager
that can take a PySpark DataFrame and store it in Parquet at the given path. It uses pyspark_resource
to access to a PySpark SparkSession for executing PySpark code within Dagster.As time goes on, this guide will be kept up to date, taking advantage of new Dagster features and learnings from the community. If you have anything you'd like to add, or an additional example you'd like to see, don't hesitate to reach out!