Clustering#
Clustering#
This module provides a workflow to perform clustering at multiple resolutions and at multiple subclustering levels (iterative sub-clustering). The code is optimised for memory efficiency and speed, allowing for clustering of large datasets.
Configuration#
Example configuration for clustering can be found in workflow/clustering/config.yaml showcasing the setup for hierarchical (iterative) sub-clustering.
DATASETS:
test_level1_multiresolution:
input:
clustering: test/input/pbmc68k.h5ad
clustering:
recompute_umap: true
recompute_neighbors: true
neighbors:
n_neighbors: 30
use_rep: X_pca
algorithm:
- leiden
# - louvain
resolutions:
- 0.2
- 0.3
- 0.4
umap_colors:
- bulk_labels
- batch
test_hierarchical:
input:
clustering: test/input/pbmc68k.h5ad
clustering:
neighbors_key: neighbors
algorithm: leiden
resolutions:
- 0.5
hierarchy:
1: 0.1
3: 0.2
umap_colors:
- bulk_labels
- batch
- n_genes
test_hierarchical_int:
input:
clustering: test/input/pbmc68k.h5ad
clustering:
neighbors_key: neighbors
algorithm: leiden
use_gpu: true
resolutions:
- 0.5
hierarchy: 3
umap_colors:
- bulk_labels
- batch
Parameters#
Clustering Parameters#
algorithm: Clustering algorithm to useSingle algorithm:
leidenorlouvainMultiple algorithms: List of algorithms (e.g.,
[leiden, louvain])
resolutions: List of clustering resolutions to testExample:
[0.2, 0.4, 0.8, 1.0, 1.6]
neighbors_key: Key for pre-computed neighbors inadata.unsDefault: Uses existing neighbors if available
hierarchy: Configuration for iterative sub-clusteringDictionary format:
{cluster_id: resolution}(e.g.,{1: 0.1, 3: 0.2})Integer format: Maximum hierarchy depth (e.g.,
3)
Neighbor Computation#
recompute_neighbors: Boolean flag to recompute neighborstrue: Force recomputation of neighborsfalse(default): Use existing neighbors if available
neighbors: Parameters for neighbor computation when recomputingn_neighbors: Number of neighbors (e.g.,30)use_rep: Representation to use (e.g.,X_pca)
UMAP Configuration#
recompute_umap: Boolean flag to recompute UMAP embeddingtrue: Force recomputation of UMAPfalse(default): Use existing UMAP if available
umap_colors: List of metadata columns to use for UMAP coloringExample:
[bulk_labels, batch, n_genes]
Performance Options#
use_gpu: Boolean flag to enable GPU accelerationtrue: Use GPU for clustering when availablefalse(default): Use CPU only
n_cell_cpu: Threshold for forcing CPU computation - CPU is used when dataset has fewer than this many cells (default: 100000)
Calling the clustering script from the command line#
The clustering script is located at workflow/clustering/scripts/clustering.py and performs single-level clustering operations. For hierarchical clustering, you need to run the script multiple times in sequence.
The script automatically detects GPU availability and uses RAPIDS when possible for accelerated clustering. It falls back to scanpy implementation when needed.
Make sure you are using the envs/scanpy.yaml environment, which has all the necessary dependencies installed.
Check the help message for the script to see all available options:
python workflow/clustering/scripts/clustering.py -h
usage: clustering.py [-h] [--resolution RESOLUTION] [--algorithm {louvain,leiden}] [--level LEVEL] [--threads THREADS] [--overwrite] [--max_cluster_factor MAX_CLUSTER_FACTOR] [--clustering_args CLUSTERING_ARGS] [--neighbors_key NEIGHBORS_KEY] [--neighbors_args NEIGHBORS_ARGS] [--n_cell_cpu N_CELL_CPU]
input_file output_file
Run clustering on anndata file.
positional arguments:
input_file Input anndata file (.h5ad or .zarr)
output_file Output anndata file (.zarr)
options:
-h, --help show this help message and exit
--resolution RESOLUTION
Clustering resolution (default: 1.0)
--algorithm {louvain,leiden}
Clustering algorithm (default: leiden)
--level LEVEL Hierarchical clustering level (default: 1)
--threads THREADS Number of threads for parallel processing (default: 1)
--overwrite Overwrite existing clustering results
--max_cluster_factor MAX_CLUSTER_FACTOR
Maximum cluster factor for heuristic check (default: 50)
--clustering_args CLUSTERING_ARGS
Additional clustering arguments as JSON string
--neighbors_key NEIGHBORS_KEY
Key for neighbors in adata.uns (default: neighbors)
--neighbors_args NEIGHBORS_ARGS
Arguments for neighbors computation as JSON string
--n_cell_cpu N_CELL_CPU
Force CPU when dataset has fewer than this many cells (default: 100000)
Script Behavior#
Level 1 Clustering#
For --level 1, the script:
Reads the input file with existing neighbors graph
Applies clustering using the specified algorithm and resolution
Outputs a zarr file containing only the clustering results in
.obs
Hierarchical Clustering (Level > 1)#
For --level > 1, the script:
Reads clustering results from the previous level
Recomputes neighbors for each cluster separately using parallel processing
Applies sub-clustering within each cluster
Combines results with hierarchical naming (e.g.,
cluster_subcluster)
GPU/CPU Handling#
Automatically detects NVIDIA GPU availability
Uses RAPIDS implementation when GPU is available and beneficial
Falls back to scanpy for small datasets (<
n_cell_cpucells)Includes heuristic checks for clustering quality with GPU fallback
Output Format#
Creates zarr-linked output preserving the original data structure
Only writes the new clustering column to minimize storage
Clustering results are named as
{algorithm}_{resolution}_{level}
Examples#
Basic clustering at level 1:#
python workflow/clustering/scripts/clustering.py \
input.h5ad \
output_level1.zarr \
--resolution 1.0 \
--algorithm leiden
Hierarchical clustering at level 2:#
# First run level 1
python workflow/clustering/scripts/clustering.py \
input.h5ad \
output_level1.zarr \
--resolution 1.0 \
--level 1
# Then run level 2 (requires level 1 results)
python workflow/clustering/scripts/clustering.py \
output_level1.zarr \
output_level2.zarr \
--resolution 0.5 \
--level 2 \
--threads 4 \
--neighbors_args '{"n_neighbors": 15, "use_rep": "X_pca"}'
Using custom clustering parameters:#
python workflow/clustering/scripts/clustering.py \
input.h5ad \
output.zarr \
--resolution 0.8 \
--clustering_args '{"n_iterations": 5, "flavor": "igraph"}' \
--overwrite
Output Structure#
Each run produces a single clustering result column:
Level 1:
leiden_1.0_1(algorithm_resolution_level)Level 2:
leiden_1.0_2(inherits resolution from level 1)
To collect multiple clustering results, read and combine the .obs columns:
import pandas as pd
from utils.io import read_anndata
# Combine results from multiple runs
obs_data = []
for file in ['output_level1.zarr', 'output_level2.zarr']:
obs = read_anndata(file, obs='obs', verbose=False).obs
obs_data.append(obs)
combined_obs = pd.concat(obs_data, axis=1)