▶️ Call the pipeline

▶️ Call the pipeline#

Create a runner script#

Next, you can create a runner script that will call the pipeline with the correct profile and configuration file(s). This way, it is easier to call the pipeline and you can avoid having to remember all the flags and options.

Below is an example of a runner script that you can use to call the pipeline.

run_example.sh#

#!/usr/bin/env bash
set -e -x

snakemake \
  --profile .profiles/local \
  --configfile configs/quickstart.yaml \
  --snakefile workflow/Snakefile \
    $@

You must set the flag --use-conda to ensure that Snakemake uses the conda environments specified in the rules. You can add this to the runner script or, as done in the quickstart example with --profile .profiles/local, to a Snakemake profile (see Snakemake profiles).

If your config file becomes very big, you can split the workflows into separate config files and include them to the configfile in the runner script.

snakemake \
  --configfile \
    config1.yaml \
    config2.yaml \
    config3.yaml \
  ...

Tip

Check out the Snakemake documentation for more commandline arguments.

Call the Snakemake pipeline#

Before running the pipeline, you need to activate your Snakemake environment.

conda activate snakemake

First dry run#

When you execute the script (say, we call it run_pipeline.sh), you can treat it like a snakemake command and add any additional snakemake arguments you want to use. A dryrun would be:

bash run_pipeline.sh -n

This will show you what Snakemake wants to run. Without specifying any rule, the default rules that the pipeline will request are common_dag and common_rulegraph. You can ignore these for now.

...
Building DAG of jobs...
Job stats:
job                 count
----------------  -------
all                     1
common_dag              1
common_rulegraph        1
total                   3

List all available rules#

The pipeline will only run the target that you explicitly tell it to run. A target can be either the name of a Snakemake rule or a file that can be generated by Snakemake (as defined by the Snakefiles). You can list all possible rules with:

bash run_pipeline.sh -l

Which should give you something like this:

all
batch_analysis_all
batch_analysis_batch_pcr
batch_analysis_collect
batch_analysis_dependency_graph
batch_analysis_determine_covariates
batch_analysis_plot
clustering_all
clustering_cluster
clustering_compute_neighbors
clustering_compute_umap
clustering_dependency_graph
clustering_merge
...
split_data_all
split_data_dependency_graph
split_data_link
split_data_split
subset_all
subset_dependency_graph
subset_subset

All the rules ending with _all are callable, i.e. you can use them to specify that their workflow should be run. The rest are needed by the pipeline, but can’t be called by the user, you can just ignore them.

Specify which workflow/rule you want to run#

Given the config above, you can call the integration workflow by specifying the integration_all target:

bash run_pipeline.sh integration_all -n

This should list all the rules with details such as inputs, outputs and parameters, as well as the following summary:

...

Job stats:
job                                    count
-----------------------------------  -------
integration_all                            1
integration_barplot_per_dataset            3
integration_benchmark_per_dataset          1
integration_compute_umap                   6
integration_plot_umap                      6
integration_postprocess                    6
integration_prepare                        1
integration_run_method                     3
preprocessing_assemble                     1
preprocessing_highly_variable_genes        1
preprocessing_normalize                    1
preprocessing_pca                          1
total                                     31

Reasons:
    (check individual jobs above for details)
    input files updated by another job:
        integration_all, integration_barplot_per_dataset, integration_benchmark_per_dataset, integration_compute_umap, integration_plot_umap, integration_postprocess, integration_prepare, integration_run_method, preprocessing_assemble, preprocessing_highly_variable_genes, preprocessing_pca
    missing output files:
        integration_benchmark_per_dataset, integration_compute_umap, integration_postprocess, integration_prepare, integration_run_method, preprocessing_assemble, preprocessing_highly_variable_genes, preprocessing_normalize, preprocessing_pca

This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

Notice, that this also includes preprocessing jobs that are needed as input to the integration module. But since the pipeline command only specified the integration output, Snakemake only resolves the preprocessing jobs that are essential to the integration module. If you want to include all preprocessing jobs (e.g. UMAPs for each input), you need to include it in the command:

bash run_pipeline.sh preprocessing_all integration_all -n

Following the same principle, you can call the metrics by including the metrics_all rule to the target list:

bash run_pipeline.sh preprocessing_all integration_all metrics_all -n

Execute the workflow#

If you are happy with the dry run, you can execute the workflow by specifying the number of cores you want to provide for the pipeline.

bash run_pipeline.sh preprocessing_all integration_all metrics_all -c 10

The code will be executed locally, wherever you called the command, but you can also configure Snakemake profiles for cluster execution. Check out Cluster execution for an example profile for SLURM executions.

Note

You have now successfully set up and configured your pipeline! Give it a spin and feel free to edit the configs to your custom workflow! 🎉