🔧 Configure Your Workflow#

Configuring your workflow requires setting global options as well as subworkflows consisting of modules. Global configuration allows you to set output locations, computational resources, and other settings used across all modules, while module settings affect the behaviour of a module for a given task.

Note

The recommended way to manage your workflow configuration files is to save them outside of the toolbox directory in a directory dedicated to your project. That way you can guarantee the separation of the toolbox and your own configuration.

You can find example configuration files under configs/.

TL;DR Complete configuration for quickstart (click to expand)
configs/quickstart.yaml#
  output_dir: data/out
  images: images
  use_gpu: false

  env_mode: local

  DATASETS:

    task_name: # custom task/workflow name

      # input specification: map of module name to map of input file name to input file path
      input:
        preprocessing:
          file_1: data/pbmc68k.h5ad
        integration: preprocessing
        metrics: integration
      
      # module configuration
      preprocessing:
        highly_variable_genes:
          n_top_genes: 2000
        pca:
          n_comps: 50
        assemble:
          - normalize
          - highly_variable_genes
          - pca
      
      # module configuration
      integration:
        raw_counts: layers/counts
        norm_counts: layers/normcounts
        batch: batch
        methods:
          unintegrated:
          bbknn:
          combat:
          harmony_pytorch:
          harmonypy:
          scanorama:
            batch_size: 100
          scvi:
            max_epochs: 10
            early_stopping: true
          drvi:
            max_epochs: 10
            early_stopping: true
          sysvi:
            max_epochs: 10
            early_stopping: true
            system_key: phase

      # module configuration
      metrics:
        unintegrated: layers/normcounts
        batch: batch
        label: bulk_labels
        metrics:
          - asw_batch
          - graph_connectivity
          - ilisi
          - pcr_comparison
          - ari
          - asw_label
          - cell_cycle
          - clisi
          - isolated_label_asw
          - isolated_label_f1
          - nmi
          - morans_i_random
          - kbet_pg
          - pcr_random
          - pcr
          - pcr_genes

1. Global configuration: Output settings#

You can specify pipeline output as follows. Intermediate and large files will be stored under output_dir, while images and smaller outputs used for understanding the outputs will be stored under images. If you use relative paths, make them relative to where you call the pipeline (not the config file itself). The directories will be created if they don’t yet exist.

# Note: relative paths must be relative to the project root, not the directory of the config file.
output_dir: data/out
images: images

Another setting is the output file pattern map. By default, the final output pattern of a rule follows the pattern of <out_dir>/<module>/<wildcard>~{<wildcard>}/<more wildcards>.zarr. For some modules, the final output pattern differs from that default and needs to be specified explicitly in the output_map. In future, this shouldn’t be necessary.

output_map:
  sample_representation: data/out/sample_representation/dataset~{dataset}/file_id~{file_id}/pseudobulk.h5ad
  subset: data/out/subset/dataset~{dataset}/file_id~{file_id}/by_sample.zarr
  pca: data/out/preprocessing/dataset~{dataset}/file_id~{file_id}/pca.zarr
  neighbors: data/out/preprocessing/dataset~{dataset}/file_id~{file_id}/neighbors.zarr
  preprocessing: data/out/preprocessing/dataset~{dataset}/file_id~{file_id}/preprocessed.zarr
  metrics: data/out/metrics/results/per_dataset/{dataset}_metrics.tsv

The default output settings under configs/outputs.yaml should work out of the box.

2. Global configuration: Computational settings#

Depending on the hardware you have available, you can configure the workflow to make use of them. If you have a GPU, set use_gpu to true and the pipeline will try to use the GPU for all modules that support it. The same applies if you have an Intel CPU. In the backend, this affects which conda environment Snakemake uses, whenever hardware-accelerated environments are specified in a rule.

env_mode: local
use_gpu: true

3. Input configuration#

You can select and combine modules to create a custom workflow by specifying the input and module configuration in a YAML file. Each instance of a workflow needs a unique task name and it can take any number of inputs consisting of modules.

DATASETS: # TODO: rename to TASKS

  task_name: # custom task/workflow name
    # input specification: map of module name to map of input file name to input file path
    input:
      preprocessing:
        file_1: data/pbmc68k.h5ad
        # file_2: ... # more files if required
      integration: preprocessing # all outputs of module will automatically be used as input
      metrics: integration

  another_dataset:
    ...

Warning

There can only be one instance of a module as a key in the input mapping (in the backend this is a dictionary). But you can reuse the same module output as input for multiple other modules. The order of the entries in the input mapping doesn’t matter.

4. Module configuration#

You can configure the behaviour of each module by specifying their parameters under the same dataset name.

DATASETS:
  task_name:
    input:
      ...

    # module configuration
    preprocessing:
      highly_variable_genes:
        n_top_genes: 2000
      pca:
        n_comps: 50
      assemble:
        - normalize
        - highly_variable_genes
        - pca

    # module configuration
    integration:
      raw_counts: raw/X
      norm_counts: X
      batch: batch
      methods:
        unintegrated:
        scanorama:
          batch_size: 100
        scvi:
          max_epochs: 10
          early_stopping: true

    # module configuration
    metrics:
      unintegrated: layers/norm_counts
      batch: batch
      label: bulk_labels
      methods:
        - nmi
        - graph_connectivity

Each module has a specific set of parameters that can be configured. Read more about the specific parameters in the README of the module you want to use.