The DagsterInstance
defines all of the configuration that Dagster needs for a single deployment - for example, where to store the history of past runs and their associated logs, where to stream the raw logs from op compute functions, and how to launch new runs.
All of the processes and services that make up your Dagster deployment should share a single instance config file so that they can effectively share information.
Some important configuration, like execution parallelism, is set on a per-job basis rather than on the instance.
When you launch a Dagster process, like Dagit or the Dagster CLI commands, Dagster attempts to load your instance. If the environment variable DAGSTER_HOME
is set, Dagster will look for an instance config file at $DAGSTER_HOME/dagster.yaml
. This file contains each of the configuration settings that make up the instance.
By default (if dagster.yaml
is not present or nothing is specified in that file), Dagster will store this information on the local filesystem, laid out like this:
$DAGSTER_HOME ├── dagster.yaml ├── history │ ├── runs │ │ ├── 00636713-98a9-461c-a9ac-d049407059cd.db │ │ └── ... │ └── runs.db └── storage ├── 00636713-98a9-461c-a9ac-d049407059cd │ └── compute_logs │ ├── my_solid.compute.complete │ ├── my_solid.compute.err │ ├── my_solid.compute.out │ └── ... └── ...
The runs.db
and {run_id}.db
files are SQLite database files recording information about runs and per-run event logs respectively. The compute_logs
directories (one per run) contain the stdout and stderr logs from the execution of the compute functions of each op.
If DAGSTER_HOME
is not set, the Dagster tools will use an ephemeral instance for execution. In this case, the run and event log storages will be in-memory rather than persisted to disk, and filesystem storage will use a temporary directory that is cleaned up when the process exits. This is useful for tests and is the default when invoking Python APIs such as JobDefinition.execute_in_process
directly.
In persistent Dagster deployments, you will typically want to configure many of the components on the instance. For example, you may want to use a Postgres instance to store runs and the corresponding event logs, and to stream compute logs to an S3 bucket.
To do this, provide a $DAGSTER_HOME/dagster.yaml
file. Dagit and all Dagster tools will look for this file on startup. In the dagster.yaml
file, you can configure many different aspects of your Dagster Instance, all of which are detailed below.
Note that Dagster supports retrieving instance YAML values from environment variables, using an env:
key instead of a string literal value. Examples of using env:
are included in the sample configurations below.
Dagster storage configures how job and asset history is persisted - this includes metadata on runs, event logs, schedule/sensor ticks, and other useful data.
To configure storage, you should set the storage
attribute in your dagster.yaml
. There are three available options:
DagsterSqliteStorage
uses a Sqlite DB for storage.# there are two ways to set storage to SqliteStorage # this config manually sets the directory (`base_dir`) for Sqlite to store data in: storage: sqlite: base_dir: /path/to/dir # and this config grabs the directory from an environment variable storage: sqlite: base_dir: env: SQLITE_STORAGE_BASE_DIR
DagsterPostgresStorage
uses a Postgres DB as the backing storage solution. This requires that the `dagster-postgres` library be installed.# Postgres storage can be set using either credentials or a connection string. This requires that # the `dagster-postgres` library be installed. # this config manually sets the Postgres credentials storage: postgres: postgres_db: username: { DAGSTER_PG_USERNAME } password: { DAGSTER_PG_PASSWORD } hostname: { DAGSTER_PG_HOSTNAME } db_name: { DAGSTER_PG_DB } port: 5432 # and this config grabs the database credentials from environment variables storage: postgres: postgres_db: username: env: DAGSTER_PG_USERNAME password: env: DAGSTER_PG_PASSWORD hostname: env: DAGSTER_PG_HOST db_name: env: DAGSTER_PG_DB port: 5432 # and this config sets the credentials via DB connection string / url: storage: postgres: postgres_url: { PG_DB_CONN_STRING } # This config gets the DB connection string / url via environment variables: storage: postgres: postgres_url: env: PG_DB_CONN_STRING
DagsterMySQLStorage
uses a Postgres DB as the backing storage solution. This requires that the `dagster-mysql` library be installed.# MySQL storage can be set using either credentials or a connection string. This requires that the # `dagster-mysql` library be installed. # this config manually sets the MySQL credentials storage: mysql: mysql_db: username: { DAGSTER_MYSQL_USERNAME } password: { DAGSTER_MYSQL_PASSWORD } hostname: { DAGSTER_MYSQL_HOSTNAME } db_name: { DAGSTER_MYSQL_DB } port: 3306 # and this config grabs the database credentials from environment variables storage: mysql: mysql_db: username: env: DAGSTER_MYSQL_USERNAME password: env: DAGSTER_MYSQL_PASSWORD hostname: env: DAGSTER_MYSQL_HOSTNAME db_name: env: DAGSTER_MYSQL_DB port: 3306 # and this config sets the credentials via DB connection string / url: storage: mysql: mysql_url: { MYSQL_DB_CONN_STRING } # this config grabs the MySQL connection string from environment variables storage: mysql: mysql_url: env: MYSQL_DB_CONN_STRING
The run launcher determines where runs are executed.
There are several Dagster-provided options for the Run Launcher; users also can write custom run launchers. See the Run Launcher docs for more information.
To configure the Run Launcher, set run_launcher
in your dagster.yaml
in one of the following ways:
The DefaultRunLauncher
spawns a new process in the same node as a job's repository location. Please see the Run Launcher docs for deployment information.
run_launcher: module: dagster.core.launcher class: DefaultRunLauncher
The DockerRunLauncher
allocates a Docker container per run. Please see the Run Launcher docs for deployment information.
run_launcher: module: dagster_docker class: DockerRunLauncher
The K8sRunLauncher
allocates a Kubernetes Job per run. Please see the Run Launcher docs for deployment information.
# there are multiple ways to configure the K8sRunLauncher # you can set the follow configuration values directly run_launcher: module: dagster_k8s.launcher class: K8sRunLauncher config: service_account_name: pipeline_run_service_account job_image: my_project/dagster_image:latest instance_config_map: dagster-instance postgres_password_secret: dagster-postgresql-secret # alternatively, you can grab any of these config values from environment variables: run_launcher: module: dagster_k8s.launcher class: K8sRunLauncher config: service_account_name: env: PIPELINE_RUN_SERVICE_ACCOUNT job_image: env: DAGSTER_IMAGE_NAME instance_config_map: env: DAGSTER_INSTANCE_CONFIG_MAP postgres_password_secret: env: DAGSTER_POSTGRES_SECRET
The run coordinator determines the policy used to determine the prioritization rules and concurrency limits for runs. Please see the Run Coordinator Docs for more information and for troubleshooting help.
To configure the Run Coordinator, set the run_coodinator
key in your dagster.yaml
. There are two options:
The DefaultRunCoordinator
immediately sends runs to the run launcher (no notion of Queued
runs).
See the Run Coordinator docs for more information.
# Since DefaultRunCoordinator is the default option, omitting the `run_coordinator` key will also suffice, # but if you would like to set it explicitly: run_coordinator: module: dagster.core.run_coordinator class: DefaultRunCoordinator
The QueuedRunCoordinator
allows you to set limits on the number of runs that can be executing at once. Note that this requires a dagster-daemon process to be active to actually launch the runs.
This run coordinator has several configuration options, which allow for both limiting the overall number of concurrent runs as well as more specific limits based on run tags - for example, you can configure a limit on the number of runs that interact with a particular cloud service that can run concurrently to avoid being throttled.
For more information, see the Run Coordinator docs.
# There are a few ways to configure the QueuedRunCoordinator: # this first option has concurrency limits set to default values run_coordinator: module: dagster.core.run_coordinator class: QueuedRunCoordinator # this second option manually specifies limits: run_coordinator: module: dagster.core.run_coordinator class: QueuedRunCoordinator config: max_concurrent_runs: 25 tag_concurrency_limits: - key: "database" value: "redshift" limit: 4 - key: "dagster/backfill" limit: 10 # as always, some or all of these values can be obtained from environment variables: run_coordinator: module: dagster.core.run_coordinator class: QueuedRunCoordinator config: max_concurrent_runs: env: DAGSTER_OVERALL_CONCURRENCY_LIMIT tag_concurrency_limits: - key: "database" value: "redshift" limit: env: DAGSTER_REDSHIFT_CONCURRENCY_LIMIT - key: "dagster/backfill" limit: env: DAGSTER_BACKFILL_CONCURRENCY_LIMIT
Compute log storage controls the capture and persistence of raw stdout & stderr text logs.
To configure Compute Log Storage, set the compute_logs
key in your dagster.yaml
.
LocalComputeLogManager
writes stdout & stderr logs to disk.# there are two ways to set the directory that the LocalComputeLogManager writes # stdout & stderr logs to # You could directly set the `base_dir` key compute_logs: module: dagster.core.storage.local_compute_log_manager class: LocalComputeLogManager config: base_dir: /path/to/directory # Alternatively, you could set the `base_dir` key to an environment variable compute_logs: module: dagster.core.storage.local_compute_log_manager class: LocalComputeLogManager config: base_dir: env: LOCAL_COMPUTE_LOG_MANAGER_DIRECTORY
AzureBlobComputeLogManager
writes stdout & stderr logs to Azure Blob Storage.# there are multiple ways to configure the AzureBlobComputeLogManager # you can set the necessary configuration values directly: compute_logs: module: dagster_azure.blob.compute_log_manager class: AzureBlobComputeLogManager config: storage_account: mycorp-dagster container: compute-logs secret_key: foo local_dir: /tmp/bar prefix: dagster-test- # alternatively, you can obtain any of these config values from environment variables compute_logs: module: dagster_azure.blob.compute_log_manager class: AzureBlobComputeLogManager config: storage_account: env: MYCORP_DAGSTER_STORAGE_ACCOUNT_NAME container: env: CONTAINER_NAME secret_key: env: SECRET_KEY local_dir: env: LOCAL_DIR_PATH prefix: env: DAGSTER_COMPUTE_LOG_PREFIX
S3ComputeLogManager
writes stdout & stderr logs to AWS S3.# there are multiple ways to configure the S3ComputeLogManager # you can set the config values directly: compute_logs: module: dagster_aws.s3.compute_log_manager class: S3ComputeLogManager config: bucket: "mycorp-dagster-compute-logs" prefix: "dagster-test-" # or grab some or all of them from environment variables compute_logs: module: dagster_aws.s3.compute_log_manager class: S3ComputeLogManager config: bucket: env: MYCORP_DAGSTER_COMPUTE_LOGS_BUCKET prefix: env: DAGSTER_COMPUTE_LOG_PREFIX
Local artifact storage is used to configure storage for any artifacts that require a local disk, or when using the filesystem IO manager to store inputs and outputs. See IO Managers for more information on how other IO managers store artifacts.
To configure Local Artifact Storage, set local_artifact_storage
as follows in your dagster.yaml
:
LocalArtifactStorage
is currently the only option for Local Artifact Storage. This configures the directory used by the default filesystem IO Manager, as well as any artifacts that require a local disk.# there are two possible ways to configure LocalArtifactStorage # example local_artifact_storage setup pointing to /var/shared/dagster directory local_artifact_storage: module: dagster.core.storage.root class: LocalArtifactStorage config: base_dir: "/path/to/dir" # alternatively, `base_dir` can be set to an environment variable local_artifact_storage: module: dagster.core.storage.root class: LocalArtifactStorage config: base_dir: env: DAGSTER_LOCAL_ARTIFACT_STORAGE_DIR
This allows opting in/out (set to true
by default) of Dagster collecting anonymized usage statistics.
To configure Telemetry, set the telemetry
key in your dagster.yaml
.
For more information on how and why we use telemetry, please visit the Telemetry Docs.
# Allows opting out of Dagster collecting usage statistics. telemetry: enabled: false
The code_servers
key lets you configure how Dagster loads the code in your workspace.
When you aren't running your own gRPC server, Dagit and the Dagster Daemon load your code from a gRPC server running in a subprocess. By default, if your code takes more than 60 seconds to load, Dagster will assume that it is hanging and stop waiting for it to load. If you expect that your repository code will take longer than 60 seconds to load, you can set the local_startup_timeout
key:
# Configures how long Dagster waits for repositories # to load before timing out. code_servers: local_startup_timeout: 120
The retention
key lets you configure how long Dagster retains certain types of data that have diminishing value over time, like schedule/sensor tick data. If you want to clean up old ticks to minimize storage concerns and improve query performance, you can set retention policy using the retention
config key:
For schedule and sensor ticks, you can specify the field purge_after_days
, which takes either a mapping of tick types to integers, or an integer that applies to all tick types. This integer determines after how many days that ticks can be safely removed. A value of -1
indicates that ticks should be retained indefinitely.
# Configures how long Dagster keeps sensor / schedule tick data retention: schedule: purge_after_days: 90 # sets retention policy for schedule ticks of all types sensor: purge_after_days: skipped: 7 failure: 30 success: -1 # keep success ticks indefinitely
By default, Dagster retains skipped sensor ticks for 7 days and retains all other ticks indefinitely.
The sensors
key lets you configure how your sensors get evaluated. If you want your sensors to be evaluated asynchronously, you can set the use_threads
attribute as well as a num_workers
config setting.
sensors: use_threads: true num_workers: 8
By default, Dagster evaluates sensors synchronously.