🔧 Configure Your Workflow#

Configuring your workflow requires setting global options as well as subworkflows consisting of modules. Global configuration allows you to set output locations, computational resources, and other settings used across all modules, while module settings affect the behaviour of a module for a given task.

Note

The recommended way to manage your workflow configuration files is to save them outside of the toolbox directory in a directory dedicated to your project. That way you can guarantee the separation of the toolbox and your own configuration.

You can find example configuration files under configs/.

TL;DR Complete configuration for quickstart (click to expand)
configs/quickstart.yaml#
# =============================================================================
# Global settings
# =============================================================================

# Note: relative paths must be relative to the project root, not the directory of the config file.
output_dir: data/out   # large/intermediate files
images: images         # plots and smaller outputs

use_gpu: false
env_mode: local


# =============================================================================
# Tasks: define one or more named workflows, each with input and module config
# =============================================================================

DATASETS:

  my_task: # ---------- custom task name (rename as needed) ----------

    # -------------------------------------------------------------------------
    # Input: map each module to its input files or the output of a prior module
    # -------------------------------------------------------------------------
    input:
      preprocessing:
        file_1: data/pbmc68k.h5ad   # path to input file
      integration: preprocessing    # use all outputs of preprocessing as input
      metrics: integration          # use all outputs of integration as input

    # -------------------------------------------------------------------------
    # Module: preprocessing
    # -------------------------------------------------------------------------
    preprocessing:
      highly_variable_genes:
        n_top_genes: 2000
      pca:
        n_comps: 50
      assemble:             # ordered list of steps to run
        - normalize
        - highly_variable_genes
        - pca

    # -------------------------------------------------------------------------
    # Module: integration
    # -------------------------------------------------------------------------
    integration:
      raw_counts: layers/counts        # key for raw count matrix in the data object
      norm_counts: layers/normcounts   # key for normalized count matrix
      batch: batch                     # column in obs used as batch key
      methods:                         # list integration methods to run (omit value to use defaults)
        unintegrated:
        bbknn:
        combat:
        harmony_pytorch:
        harmonypy:
        scanorama:
          batch_size: 100
        scvi:
          max_epochs: 10
          early_stopping: true
        drvi:
          max_epochs: 10
          early_stopping: true
        sysvi:
          max_epochs: 10
          early_stopping: true
          system_key: phase

    # -------------------------------------------------------------------------
    # Module: metrics
    # -------------------------------------------------------------------------
    metrics:
      unintegrated: layers/normcounts  # baseline embedding to compare against
      batch: batch                     # column in obs used as batch key
      label: bulk_labels               # column in obs used as cell type label
      metrics:                         # list of metrics to compute
        - asw_batch
        - graph_connectivity
        - ilisi
        - pcr_comparison
        - kbet_pg
        - ari
        - asw_label
        - cell_cycle
        - clisi
        - isolated_label_asw
        - isolated_label_f1
        - nmi
        - morans_i_random
        - pcr_random
        - pcr
        - pcr_genes

1. Global configuration#

These settings define your project’s output structure and hardware utilization. The directories will be created automatically if they do not already exist.

Output & Computational Settings

  • Output Paths: Intermediate and large files are stored under output_dir, while images and summary files are stored under images.

  • Hardware Acceleration: If you have a supported GPU, set use_gpu to true. In the backend, this ensures Snakemake utilizes GPU-enabled conda environments for supported rules.

Note

Relative paths must be relative to the project root (where you call the pipeline), not the directory of the configuration file itself.

# Output locations
output_dir: data/out
images: images

# Hardware settings
use_gpu: true

2. Input configuration#

You can select and combine modules to create a custom workflow by specifying the input and module configuration in a YAML file. Each instance of a workflow needs a user-defined task name and it can take any number of inputs consisting of modules.

Under each task, the input section lists module names, and each module is mapped to either its input files or the output of a previous module.

DATASETS:

  my_task:
    input:
      preprocessing:
        file_1: data/pbmc68k.h5ad
        # file_2: ... # more files if required
      integration: preprocessing # all outputs of module will automatically be used as input
      metrics: integration

  another_dataset:
    ...

See documentation for more details on how to specify input mappings and the different formats you can use, as well as how they are resolved to file ids for downstream modules.

Warning

There can only be one instance of a module as a key in the input mapping (in the backend this is a dictionary). But you can reuse the same module output as input for multiple other modules. The order of the entries in the input mapping doesn’t matter.

3. Module configuration#

You can configure the behaviour of each module by specifying their parameters under the same dataset name.

DATASETS:
  my_task:
    input:
      ...

    # module configuration
    preprocessing:
      highly_variable_genes:
        n_top_genes: 2000
      pca:
        n_comps: 50
      assemble:
        - normalize
        - highly_variable_genes
        - pca

    # module configuration
    integration:
      raw_counts: raw/X
      norm_counts: X
      batch: batch
      methods:
        unintegrated:
        scanorama:
          batch_size: 100
        scvi:
          max_epochs: 10
          early_stopping: true

    # module configuration
    metrics:
      unintegrated: layers/norm_counts
      batch: batch
      label: bulk_labels
      methods:
        - nmi
        - graph_connectivity

Each module has a specific set of parameters that can be configured. Read more about the specific parameters in the README of the module you want to use.