A workspace is a collection of user-defined repositories and information about where to find them. Dagster tools, like Dagit and the Dagster CLI, use workspaces to load user code.
Name | Description |
---|---|
@repository | The decorator used to define repositories. The decorator returns a RepositoryDefinition |
A Dagster Workspace can contain multiple repositories sourced from a variety of different locations, such as Python modules or Python virtualenvs.
The repositories in a workspace are loaded in a different process and communicate with Dagster system processes over an RPC mechanism. This architecture provides several advantages:
The structure of a workspace is encoded in a yaml document. By convention is it named workspace.yaml
. The workspace yaml encodes where to find repositories.
Each entry in the workspace is referred to as a repository location. A repository location can include more than one repository. Each repository location is loaded in its own process that Dagster tools use an RPC protocol to communicate with. This process separation allows multiple repository locations in different environments to be loaded independently, and ensures that errors in user code can't impact Dagster system code.
If you want to load repositories from Python code, use the python_file
or python_package
keys in your workspace YAML.
If you use python_file
, it must specify a path relative to the workspace file leading to a file containing at least one repository definition. For example, the repository defined in repository.py
here:
from dagster import job, op, repository @op def hello_world(): pass @job def hello_world_job(): hello_world() @repository def hello_world_repository(): return [hello_world_job]
could be loaded using the following workspace.yaml
file in the same folder:
load_from: - python_file: repository.py
Both Dagit and the dagster-daemon process need to know how to load your workspace. By default, they will look for the workspace.yaml
file in the current directory, allowing you to launch from that directory with no arguments:
dagit
Using a workspace.yaml
file keeps you from having to type the same command-line flags repeatedly to load your code. You can also load the workspace yaml file from a different folder with -w
. See detailed references in the API docs.
You might have more than one repository in scope and want to specify a specific one. Our schema supports this with the attribute
key, which must be a repository name or the name of a function that returns a RepositoryDefinition
. For example:
load_from: - python_file: relative_path: hello_world_repository.py attribute: hello_world_repository
The example above also illustrates that the python_file
key can be a single string if the only configuration needed is the relative file path, but must be a map if more parameters are added.
python_package
can also be used instead of python_file
to load code from a local or installed Python package.
load_from: - python_package: hello_world_package
A deprecated python_module
key with identical semantics to python_package
also exists, but you should prefer using either python_package
or python_file
.
You can also specify an attribute to identify a single repository within a package, as with python_file
:
load_from: - python_package: package_name: yourproject.hello_world_repository attribute: hello_world_repository
By default, your code will load with dagit
's working directory as the base path to resolve any local imports in your code. You can supply a custom working directory for relative imports using the working_directory
key. For example:
load_from: - python_file: relative_path: hello_world_repository.py working_directory: my_working_directory/ - python_package: package_name: my_team_package working_directory: my_other_working_directory/
By default, Dagit and other Dagster tools assume that repository locations should be loaded using the same Python environment that was used to load Dagster. However, it is often useful for repository locations to use independent environments. For example, a data engineering team running Spark can have dramatically different dependencies than an ML team running Tensorflow.
To enable this use case, Dagster supports customizing the Python environment for each repository location, by adding the executable_path
key to the YAML for a location. These environments can involve distinct sets of installed dependencies, or even completely different Python versions.
load_from: - python_package: package_name: dataengineering_spark_repository location_name: dataengineering_spark_team_py_38_virtual_env executable_path: venvs/path/to/dataengineering_spark_team/bin/python - python_file: relative_path: path/to/team_repos.py location_name: ml_team_py_36_virtual_env executable_path: venvs/path/to/ml_tensorflow/bin/python
The example above also illustrates the location_name
key. Each repository location in a workspace has a unique name that is displayed in Dagit, and is also used to disambiguate definitions with the same name across multiple repository locations. Dagster will supply a default name for each location based on its workspace entry if a custom one is not supplied.
When you start a schedule or a sensor, a serialized representation of the entry in your workspace.yaml
file is stored in a database, and the dagster-daemon
process uses this serialized representation to identify and load your schedule or sensor. This means that if you rename a repository location or change its configuration in your workspace.yaml
, you must also stop and restart any running schedules or sensors in that repository location. You can do this within Dagit from the Status page under Schedules or Sensors.
If you run dagit
in the same folder as a workspace.yaml
file, it will load all the repositories in each repository location defined by the workspace. (You can also specify a different workspace file using the -w
command-line argument).
It's possible that one or more of your repository locations can't be loaded - for example, there might be a syntax error or some other unrecoverable error in one of your definitions. When this occurs, a warning message will appear in Dagit in the left-hand panel, directing you to a status page with an error message and stack trace for any locations that were unable to load.
By default, Dagster tools will automatically create a process on your local machine for each of your repository locations. However, it is also possible to run your own gRPC server that's responsible for serving information about your repositories. This can be useful in more complex system architectures that deploy user code separately from Dagit.
The Dagster gRPC server needs to have access to your code. To initialize the server, run the dagster api grpc
command and pass it a target file or module to load.
Similar to when specifying code in your workspace.yaml
, the target can be either a python file or python package, and you can use an 'attribute' flag to identify a single repository within that target. The server will automatically find and load the specified repositories.
You also need to specify a host and either a port or socket to run the server on.
# Load gRPC Server running on a port using a python file: dagster api grpc --python-file /path/to/file.py --host 0.0.0.0 --port 4266 # Load gRPC Server running on a socket using a python file: dagster api grpc --python-file /path/to/file.py --host 0.0.0.0 --socket /path/to/socket # Load gRPC Server using a python package: dagster api grpc --package-name my_package_name --host 0.0.0.0 --port 4266 # Specify an attribute within the target to load a specific repository: dagster api grpc --python-file /path/to/file.py --attribute my_repository --host 0.0.0.0 --port 4266 # Specify a working directory to use as the base folder for local imports: dagster api grpc --python-file /path/to/file.py --working-directory /var/my_working_dir --host 0.0.0.0 --port 4266
See the API docs for the full list of options that can be set when running a new gRPC server.
Then, in your workspace.yaml
, you can configure a new gRPC server repository location to load:
load_from: - grpc_server: host: localhost port: 4266 location_name: "my_grpc_server"
When running your own gRPC server within a container, you can tell Dagit that any runs launched from this repository location should be launched in a container with that same image. To do this, set the DAGSTER_CURRENT_IMAGE
environment variable to the name of the image before starting the server. After setting this environment variable for your server, when you view the Status page in dagit, you should see the image listed alongside your repository location.
This image will only be used by run launchers and executors that expect to use Docker images (like the DockerRunLauncher
, K8sRunLauncher
, docker_executor
, or k8s_job_executor
).
If you're using our built-in Helm chart, this environment variable is automatically set on each of your gRPC servers.