🔧 Configure Your Workflow#
Configuring your workflow requires setting global options as well as subworkflows consisting of modules. Global configuration allows you to set output locations, computational resources, and other settings used across all modules, while module settings affect the behaviour of a module for a given task.
Note
The recommended way to manage your workflow configuration files is to save them outside of the toolbox directory in a directory dedicated to your project. That way you can guarantee the separation of the toolbox and your own configuration.
You can find example configuration files under configs/.
TL;DR Complete configuration for quickstart (click to expand)
# =============================================================================
# Global settings
# =============================================================================
# Note: relative paths must be relative to the project root, not the directory of the config file.
output_dir: data/out # large/intermediate files
images: images # plots and smaller outputs
use_gpu: false
env_mode: local
# =============================================================================
# Tasks: define one or more named workflows, each with input and module config
# =============================================================================
DATASETS:
my_task: # ---------- custom task name (rename as needed) ----------
# -------------------------------------------------------------------------
# Input: map each module to its input files or the output of a prior module
# -------------------------------------------------------------------------
input:
preprocessing:
file_1: data/pbmc68k.h5ad # path to input file
integration: preprocessing # use all outputs of preprocessing as input
metrics: integration # use all outputs of integration as input
# -------------------------------------------------------------------------
# Module: preprocessing
# -------------------------------------------------------------------------
preprocessing:
highly_variable_genes:
n_top_genes: 2000
pca:
n_comps: 50
assemble: # ordered list of steps to run
- normalize
- highly_variable_genes
- pca
# -------------------------------------------------------------------------
# Module: integration
# -------------------------------------------------------------------------
integration:
raw_counts: layers/counts # key for raw count matrix in the data object
norm_counts: layers/normcounts # key for normalized count matrix
batch: batch # column in obs used as batch key
methods: # list integration methods to run (omit value to use defaults)
unintegrated:
bbknn:
combat:
harmony_pytorch:
harmonypy:
scanorama:
batch_size: 100
scvi:
max_epochs: 10
early_stopping: true
drvi:
max_epochs: 10
early_stopping: true
sysvi:
max_epochs: 10
early_stopping: true
system_key: phase
# -------------------------------------------------------------------------
# Module: metrics
# -------------------------------------------------------------------------
metrics:
unintegrated: layers/normcounts # baseline embedding to compare against
batch: batch # column in obs used as batch key
label: bulk_labels # column in obs used as cell type label
metrics: # list of metrics to compute
- asw_batch
- graph_connectivity
- ilisi
- pcr_comparison
- kbet_pg
- ari
- asw_label
- cell_cycle
- clisi
- isolated_label_asw
- isolated_label_f1
- nmi
- morans_i_random
- pcr_random
- pcr
- pcr_genes
1. Global configuration#
These settings define your project’s output structure and hardware utilization. The directories will be created automatically if they do not already exist.
Output & Computational Settings
Output Paths: Intermediate and large files are stored under
output_dir, while images and summary files are stored underimages.Hardware Acceleration: If you have a supported GPU, set
use_gputotrue. In the backend, this ensures Snakemake utilizes GPU-enabled conda environments for supported rules.
Note
Relative paths must be relative to the project root (where you call the pipeline), not the directory of the configuration file itself.
# Output locations
output_dir: data/out
images: images
# Hardware settings
use_gpu: true
2. Input configuration#
You can select and combine modules to create a custom workflow by specifying the input and module configuration in a YAML file. Each instance of a workflow needs a user-defined task name and it can take any number of inputs consisting of modules.
Under each task, the input section lists module names, and each module is mapped to either its input files or the output of a previous module.
DATASETS:
my_task:
input:
preprocessing:
file_1: data/pbmc68k.h5ad
# file_2: ... # more files if required
integration: preprocessing # all outputs of module will automatically be used as input
metrics: integration
another_dataset:
...
See documentation for more details on how to specify input mappings and the different formats you can use, as well as how they are resolved to file ids for downstream modules.
Warning
There can only be one instance of a module as a key in the input mapping (in the backend this is a dictionary). But you can reuse the same module output as input for multiple other modules. The order of the entries in the input mapping doesn’t matter.
3. Module configuration#
You can configure the behaviour of each module by specifying their parameters under the same dataset name.
DATASETS:
my_task:
input:
...
# module configuration
preprocessing:
highly_variable_genes:
n_top_genes: 2000
pca:
n_comps: 50
assemble:
- normalize
- highly_variable_genes
- pca
# module configuration
integration:
raw_counts: raw/X
norm_counts: X
batch: batch
methods:
unintegrated:
scanorama:
batch_size: 100
scvi:
max_epochs: 10
early_stopping: true
# module configuration
metrics:
unintegrated: layers/norm_counts
batch: batch
label: bulk_labels
methods:
- nmi
- graph_connectivity
Each module has a specific set of parameters that can be configured. Read more about the specific parameters in the README of the module you want to use.