.. _call-the-pipeline:

▶️ Call the pipeline
====================

Create a runner script
----------------------
Next, you can create a runner script that will call the pipeline with the correct profile and configuration file(s).
This way, it is easier to call the pipeline and you can avoid having to remember all the flags and options.

Below is an example of a runner script that you can use to call the pipeline.

.. literalinclude:: ../../run_example.sh
    :language: bash
    :caption: run_example.sh

You must set the flag ``--use-conda`` to ensure that Snakemake uses the conda environments specified in the rules.
You can add this to the runner script or, as done in the quickstart example with ``--profile .profiles/local``, to
a Snakemake profile (see :ref:`snakemake_profiles`).

.. dropdown:: Example profile configuration for local computation
    :icon: beaker

    .. literalinclude:: ../../.profiles/local/config.yaml
        :language: yaml
        :caption: .profiles/local/config.yaml

If your config file becomes very big, you can split the workflows into separate config files and include them to the ``configfile`` in the runner script.

.. code-block:: bash

   snakemake \
     --configfile \
       config1.yaml \
       config2.yaml \
       config3.yaml \
     ...

.. tip::
   Check out the `Snakemake documentation <https://snakemake.readthedocs.io/en/v7.31.1/executing/cli.html>`_ for more commandline arguments.


Call the Snakemake pipeline
---------------------------

Before running the pipeline, you need to activate your Snakemake environment.

.. code-block:: bash

   conda activate snakemake

.. dropdown:: How does Snakemake work?
    :icon: info

    The general command for running a pipeline is:

    .. code-block:: bash

       snakemake <snakemake args>

    The most relevant snakemake arguments are:

    - ``-n``: dry run
    - ``--use-conda``: use rule-specific conda environments to ensure all dependencies are met
    - ``-c``: maximum number of cores to be used
    - ``--configfile``: specify a config file to use. The overall workflow already defaults to the config file under ``configs/config.yaml``

    For more information on how Snakemake works, please refer to `Snakemake's extensive documentation <https://snakemake.readthedocs.io/en/v7.31.1/index.html>`_.



First dry run
-------------

When you execute the script (say, we call it ``run_pipeline.sh``), you can treat it like a snakemake command and add any additional snakemake arguments you want to use.
A dryrun would be:

.. code-block:: bash

   bash run_pipeline.sh -n

This will show you what Snakemake wants to run.
Without specifying any rule, the default rules that the pipeline will request are ``common_dag`` and ``common_rulegraph``.
You can ignore these for now.

.. code-block:: text

   ...
   Building DAG of jobs...
   Job stats:
   job                 count
   ----------------  -------
   all                     1
   common_dag              1
   common_rulegraph        1
   total                   3

List all available rules
------------------------

The pipeline will only run the target that you explicitly tell it to run.
A target can be either the name of a Snakemake rule or a file that can be generated by Snakemake (as defined by the Snakefiles).
You can list all possible rules with:

.. code-block:: bash

   bash run_pipeline.sh -l

Which should give you something like this:

.. code-block:: text

   all                            
   batch_analysis_all             
   batch_analysis_batch_pcr 
   batch_analysis_collect           
   batch_analysis_dependency_graph
   batch_analysis_determine_covariates
   batch_analysis_plot        
   clustering_all            
   clustering_cluster            
   clustering_compute_neighbors
   clustering_compute_umap            
   clustering_dependency_graph 
   clustering_merge
   ...
   split_data_all
   split_data_dependency_graph
   split_data_link
   split_data_split
   subset_all
   subset_dependency_graph
   subset_subset

All the rules ending with ``_all`` are callable, i.e. you can use them to specify that their workflow should be run.
The rest are needed by the pipeline, but can't be called by the user, you can just ignore them.


Specify which workflow/rule you want to run
-------------------------------------------

Given the config above, you can call the integration workflow by specifying the ``integration_all`` target:

.. code-block:: bash

   bash run_pipeline.sh integration_all -n

This should list all the rules with details such as inputs, outputs and parameters, as well as the following summary:

.. code-block:: text

   ...

   Job stats:
   job                                    count
   -----------------------------------  -------
   integration_all                            1
   integration_barplot_per_dataset            3
   integration_benchmark_per_dataset          1
   integration_compute_umap                   6
   integration_plot_umap                      6
   integration_postprocess                    6
   integration_prepare                        1
   integration_run_method                     3
   preprocessing_assemble                     1
   preprocessing_highly_variable_genes        1
   preprocessing_normalize                    1
   preprocessing_pca                          1
   total                                     31

   Reasons:
       (check individual jobs above for details)
       input files updated by another job:
           integration_all, integration_barplot_per_dataset, integration_benchmark_per_dataset, integration_compute_umap, integration_plot_umap, integration_postprocess, integration_prepare, integration_run_method, preprocessing_assemble, preprocessing_highly_variable_genes, preprocessing_pca                                                                                             
       missing output files:
           integration_benchmark_per_dataset, integration_compute_umap, integration_postprocess, integration_prepare, integration_run_method, preprocessing_assemble, preprocessing_highly_variable_genes, preprocessing_normalize, preprocessing_pca

   This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

Notice, that this also includes ``preprocessing`` jobs that are needed as input to the ``integration`` module.
But since the pipeline command only specified the integration output, Snakemake only resolves the preprocessing jobs that are essential to the integration module.
If you want to include all preprocessing jobs (e.g. UMAPs for each input), you need to include it in the command:

.. code-block:: bash

   bash run_pipeline.sh preprocessing_all integration_all -n

Following the same principle, you can call the metrics by including the ``metrics_all`` rule to the target list:

.. code-block:: bash

   bash run_pipeline.sh preprocessing_all integration_all metrics_all -n


Execute the workflow
--------------------

If you are happy with the dry run, you can execute the workflow by specifying the number of cores you want to provide for the pipeline.

.. code-block:: bash

   bash run_pipeline.sh preprocessing_all integration_all metrics_all -c 10

The code will be executed locally, wherever you called the command, but you can also configure :ref:`snakemake_profiles` for cluster execution.
Check out :ref:`cluster_execution` for an example profile for SLURM executions.

.. note::
   You have now successfully set up and configured your pipeline!
   Give it a spin and feel free to edit the configs to your custom workflow! 🎉