Development Guide
This guide walks you through contributing to the sc-preprocess pipeline. Whether you’re adding a new analysis method or an entirely new single-cell preprocessing step, this document provides the workflow, patterns, and testing strategies you need.
Getting Started
Follow the Installation instructions (steps 1-5), using pip install -e . for an editable install.
After pulling new changes from the repository:
git pull
pip install -e .
Building the documentation
To build and preview the docs locally, install the documentation dependencies:
pip install -e ".[docs]"
Then serve with auto-reload:
sphinx-autobuild docs/source docs/build/html
Updating dependencies
The project uses three types of environment files:
File |
Purpose |
Editable? |
|---|---|---|
|
Minimal install spec (hand-maintained) |
Yes — edit this when adding/changing deps |
|
Exact pins for Linux reproducibility |
No — regenerate with |
|
Exact pins without build hashes |
No — regenerate with |
To add a new dependency:
Do NOT run
conda env export > environment.yaml— this overwrites the minimal spec with a 240+ line platform-specific dump.
Add it to
environment.yaml(conda section or pip section)Recreate your environment:
conda env remove -n snakemake8 && conda env create -f environment.yamlOptionally regenerate lock files:
conda env export | grep -v "^prefix:" > environment_lock_linux.yaml conda env export --no-builds | grep -v "^prefix:" > environment_lock_portable.yaml
Per-rule conda environments: If you add a tool with dependency conflicts that cannot be resolved in the main environment, Snakemake supports isolated per-rule conda environments via the conda: directive. Place a .yaml file in workflows/envs/ and reference it in the rule:
rule my_rule:
...
conda:
"../envs/my_tool.yaml"
script:
"../scripts/my_script.py"
Note for RHEL8/HPC users: Snakemake’s
conda:directive activates isolated environments in a subprocess that does not inheritLD_LIBRARY_PATH. On systems where the systemlibstdc++is older than GLIBCXX_3.4.29 (e.g., RHEL8), this causes import failures for any modern conda-forge package. In that case, add the tool toenvironment.yamlinstead.
Key Concepts
Preprocessing Step: A preprocessing stage in single-cell analysis, such as demultiplexing or doublet detection. Each step has its own Snakemake rule file (workflows/rules/<step>.smk) and Pydantic schema (schemas/<step>.py) to encourage expansion to different tool options that solve the same preprocessing step e.g. cellsnp-lite + vireo or demuxalot.
Method: A software tool that implements a preprocessing step. For example, the demultiplexing step supports two methods: vireo and demuxalot. Adding a new method means integrating another tool into an existing prepcrocessing step.
Developer Workflow
Follow these steps when making changes:
Orient - Understand the project architecture and how the pipeline resolves what to run
Develop - Choose your task: add a method or add a step
Test - Validate with DAG checks and integration tests
Document - Update documentation!
Project Architecture
Understanding the file layout is essential before making changes:
sc_preprocess/
├── cli.py # CLI entry points
├── config_generator.py # Interactive config builder
├── config_validator.py # PIPELINE_DIRECTORIES, validation
├── schemas/ # Pydantic models for config validation
│ ├── base.py # BaseStepConfig (all steps inherit this)
│ ├── config.py # PipelineConfig (unified schema)
│ ├── cellranger.py # GEX/ATAC/ARC Cell Ranger configs
│ ├── demultiplexing.py # DemuxalotConfig, VireoConfig
│ └── doublet_detection.py # ScrubletConfig
├── workflows/
│ ├── main.smk # Master workflow, rule all, includes
│ ├── rules/ # One .smk file per pipeline step
│ │ ├── cellranger.smk # Cell Ranger count/aggregation
│ │ ├── object_creation.smk # Per-capture AnnData/MuData creation
│ │ ├── batch_aggregation.smk # Batch-level aggregation + metadata enrichment
│ │ ├── demultiplexing.smk # Demuxalot/Vireo
│ │ └── doublet_detection.smk # Scrublet
│ └── scripts/
│ ├── build_targets.py # Generates target files for rule all
│ ├── parse_config.py # Extracts enabled steps, methods, etc.
│ ├── create_gex_anndata.py # GEX per-capture object creation
│ ├── create_atac_anndata.py # ATAC per-capture object creation
│ ├── create_arc_mudata.py # ARC per-capture MuData creation
│ ├── aggregate_batch.py # Batch aggregation (per-capture → batch)
│ ├── merge_metadata.py # Merge analysis metadata into batch objects
│ └── run_scrublet.py # Scrublet doublet detection
tests/
├── test.sh # Integration test script
├── 00_TEST_DATA_GEX/ # Test configs and library lists
└── ...
docs/source/ # Read the Docs documentation (this file)
How the pipeline resolves what to run
Config (
pipeline_config.yaml) declares which steps areenabled: trueparse_config.py→get_enabled_steps()reads the config and returns a list of enabled step namesmain.smkconditionally includes.smkrule files based on enabled stepsbuild_targets.py→build_all_targets()generates the list of expected.donefiles forrule allEach rule produces a
.donemarker file in{output_dir}/00_LOGS/that matches whatbuild_targets.pyexpects
If the target filename from build_targets.py doesn’t match the done output in the rule, Snakemake will raise a MissingInputException.
Pipeline phases
The pipeline executes in phases, each building on the previous:
Phase 1: Cell Ranger count (per-capture)
→ 01_CELLRANGER{GEX|ATAC|ARC}_COUNT/{batch}_{capture}/
Phase 2: Cell Ranger aggregation (per-batch)
→ 02_CELLRANGER{GEX|ATAC|ARC}_AGGR/{batch}/
Phase 3: Per-capture object creation (AnnData/MuData with traceability metadata)
→ 03_ANNDATA/{batch}_{capture}.h5ad|h5mu
Phase 4: Batch aggregation (merge per-capture objects)
→ 04_BATCH_OBJECTS/{batch}_{modality}.h5ad|h5mu
Phase 5-6: Per-capture analysis (demux, doublet — run in parallel)
→ 05_DEMULTIPLEXING/, 06_DOUBLET_DETECTION/
Phase 7: Metadata enrichment (merge analysis results into batch objects)
→ 07_FINAL/{batch}_{modality}.h5ad|h5mu
Metadata enrichment is the final phase. The rules live in batch_aggregation.smk (alongside the aggregation rules) and the logic is in scripts/merge_metadata.py. For each batch, enrichment:
Reads the batch object from
04_BATCH_OBJECTS/Searches
05_DEMULTIPLEXING/and06_DOUBLET_DETECTION/for per-capture result filesJoins analysis metadata onto
adata.obsusingcell_idas the keyWrites the enriched object to
07_FINAL/
Enrichment is always the last step. It runs for every enabled modality regardless of which analysis steps are enabled — if no analysis metadata is found, it copies the batch object as-is.
Target generation for enrichment is handled by get_enriched_object_outputs() in build_targets.py, producing done files like {batch}_gex_enrichment.done.
Development Workflow
Quick Reference Checklists
Adding a method to an existing preprocessing step (e.g., a new demultiplexing tool):
Add method config schema in
schemas/<step>.pywithtool_metaRegister the method in the parent config class
Add the rule in
workflows/rules/<step>.smk— include athreads:andresources:block (see Resource parameters)Add target generation in
workflows/scripts/build_targets.pyAdd to config generator in
config_generator.py
Adding a new pipeline step (e.g., a new QC filter or analysis method):
Create the Pydantic schema in
schemas/<new_step>.pyRegister the output directory in
config_validator.pyRegister the step in
parse_config.pyAdd target generation in
build_targets.pyCreate the rule file in
workflows/rules/<new_step>.smk— include athreads:andresources:block (see Resource parameters)Include the rule file in
main.smkCreate a dummy rule and test the DAG
Implement the rule
Add to config generator
Adding a Method to an Existing Preprocessing Step
This example shows how Vireo was added alongside demuxalot for demultiplexing. Use this as a template for adding new methods.
Step 1: Add method config schema
In sc_preprocess/schemas/demultiplexing.py:
from typing import ClassVar
from .base import ToolMeta
class CellSNPConfig(BaseModel):
"""cellsnp-lite parameters for SNP calling."""
vcf: str = Field(description="Path to VCF reference file with known variants")
threads: int = Field(default=4, ge=1, description="Number of threads for cellsnp-lite")
min_maf: float = Field(default=0.0, ge=0.0, le=1.0, description="Minimum minor allele frequency")
min_count: int = Field(default=1, ge=0, description="Minimum UMI count")
class Config:
extra = "forbid"
class VireoConfig(BaseModel):
"""Vireo demultiplexing parameters (requires cellsnp-lite preprocessing)."""
tool_meta: ClassVar[ToolMeta] = ToolMeta(
package="vireoSNP",
url="https://github.com/single-cell-genetics/vireo",
)
cellsnp: CellSNPConfig = Field(description="cellsnp-lite configuration for SNP calling")
donors: int = Field(description="Number of donors to demultiplex")
class Config:
extra = "forbid"
Every method schema must include a tool_meta class variable. This enables show-params to display the installed tool version and a link to the source. Use ClassVar so Pydantic treats it as a class attribute (not a config field). For tools that aren’t Python packages (e.g., Cell Ranger), set shell_version_cmd:
tool_meta: ClassVar[ToolMeta] = ToolMeta(
package="vireo",
url="https://github.com/single-cell-genetics/vireo/releases/tag/v0.2.3",
shell_version_cmd="cellranger --version",
)
Step 2: Register the method in the parent config class
class DemultiplexingConfig(BaseStepConfig):
method: Literal["demuxalot", "vireo"] = Field(...)
vireo: Optional[VireoConfig] = None # Add this
@model_validator(mode='after')
def validate_method_params(self):
method_configs = {
"demuxalot": self.demuxalot,
"vireo": self.vireo, # Add this
}
# ... rest stays the same
Checkpoint: Verify schema registration
After completing Steps 1-2, verify the new method is visible to the CLI:
# Confirm the method appears in the registry
sc-preprocess list-methods
# Confirm schema fields and tool_meta are correct
sc-preprocess show-params --step demultiplexing --method vireo
Step 3: Add the rule
In sc_preprocess/workflows/rules/demultiplexing.smk:
if config.get("demultiplexing") and DEMUX_METHOD == "vireo":
# Parse vireo config
VIREO_CONFIG = DEMUX_CONFIG.get("vireo", {})
CELLSNP_CONFIG = VIREO_CONFIG.get("cellsnp", {})
VIREO_DONORS = VIREO_CONFIG.get("donors")
# cellsnp-lite parameters
CELLSNP_VCF = CELLSNP_CONFIG.get("vcf")
CELLSNP_THREADS = CELLSNP_CONFIG.get("threads", 4)
rule cellsnp_lite:
"""Run cellsnp-lite for SNP calling from BAM."""
input:
gex_done = os.path.join(config.get("output_dir", "output"), "00_LOGS", "{batch}_{capture}_gex_count.done"),
bam = os.path.join(GEX_COUNT_DIR, "{batch}_{capture}", "outs", "possorted_genome_bam.bam"),
barcodes = os.path.join(GEX_COUNT_DIR, "{batch}_{capture}", "outs", "filtered_feature_bc_matrix", "barcodes.tsv.gz")
output:
base_vcf = os.path.join(DEMUX_OUTPUT_DIR, "cellsnp_output_{batch}_{capture}", "cellSNP.base.vcf.gz"),
done = touch(os.path.join(config.get("output_dir", "output"), "00_LOGS", "cellsnp_output_{batch}_{capture}.done"))
# ...
rule vireo:
"""Run Vireo for donor deconvolution using cellsnp-lite output."""
input:
cellsnp_done = rules.cellsnp_lite.output.done
output:
donor_ids = os.path.join(DEMUX_OUTPUT_DIR, "vireo_output_{batch}_{capture}", "donor_ids.tsv"),
done = touch(os.path.join(OUTPUT_DIRS["logs_dir"], "vireo_output_{batch}_{capture}.done"))
# ...
Key conventions:
The
.donefilename pattern must match whatbuild_targets.pygenerates:{method}_output_{batch}_{capture}.done.donefiles always go in{output_dir}/00_LOGS/Use
touch()for.doneoutputsShared variables (output dirs, GEX count dir) go in the top-level
if config.get(...)blockAlways include
threads:andresources:blocks — without them SLURM defaults to 1 GB and jobs will OOM on real data:
from tempfile import gettempdir # must be imported explicitly in each .smk file
rule my_rule:
...
threads: config["my_step"].get("threads", 1)
resources:
mem_mb = config["my_step"].get("mem_gb", 16) * 1024, # SLURM uses MB
runtime = config["my_step"].get("runtime_minutes", 720), # passed as --time to SLURM
tmpdir = RESOURCES.get("tmpdir") or gettempdir()
Add threads, mem_gb, and runtime_minutes to the parent config class (not the method sub-config) in schemas/<step>.py:
threads: int = Field(default=1, ge=1, description="Number of threads")
mem_gb: int = Field(default=16, ge=1, description="Memory in GB")
runtime_minutes: int = Field(default=720, gt=0, description="Maximum runtime in minutes for the SLURM job")
Method-specific config goes in the method-level
ifblock
Step 4: Add target generation
In sc_preprocess/workflows/scripts/build_targets.py:
if method == "vireo":
for batch in batches:
for capture in captures:
outputs.append(os.path.join(logs_dir, f"vireo_output_{batch}_{capture}.done"))
Checkpoint: Verify rule and targets
After completing Steps 3-4, create a test config that uses the new method and verify the DAG resolves:
# Validate the config parses correctly
sc-preprocess validate-config --config-file your_test_config.yaml
# Dry run - confirm the new rule appears and targets match
sc-preprocess run --config-file your_test_config.yaml --cores 1 --dry-run
# Visual check - confirm rule dependencies look correct
sc-preprocess run --config-file your_test_config.yaml --cores 1 --dag | dot -Tpng > dag.png
If you get a MissingInputException, the .done filename in build_targets.py doesn’t match the rule output. See Common Pitfalls.
Step 5: Add to config generator
Add the new method’s parameters to the relevant modality template(s) in sc_preprocess/config_generator.py so sc-preprocess init-config includes them in the output.
Step 6: Test
See Testing below.
Adding a New Pipeline Step
Follow these steps to add a new preprocessing step (e.g., a quality-control filter, RNA velocity, etc.):
Step 1: Create the Pydantic schema
Create sc_preprocess/schemas/<new_step>.py:
"""<Step name> configuration schemas."""
from typing import ClassVar, Literal, Optional
from pydantic import BaseModel, Field, model_validator
from .base import BaseStepConfig, ToolMeta
class MyMethodConfig(BaseModel):
"""Parameters for my method."""
tool_meta: ClassVar[ToolMeta] = ToolMeta(
package="my-package",
url="https://github.com/my/package",
)
my_param: str = Field(description="An example parameter")
class Config:
extra = "forbid"
class MyStepConfig(BaseStepConfig):
"""My new step configuration."""
method: Literal["my_method"] = Field(description="Method to use")
my_method: Optional[MyMethodConfig] = None
@model_validator(mode='after')
def validate_method_params(self):
# Validate that the selected method has its config block
...
Checkpoint: Verify schema registration
After Step 1, verify the new step and its methods are visible:
sc-preprocess list-methods
sc-preprocess show-params --step my_step --method my_method
Step 2: Register the output directory
In sc_preprocess/config_validator.py, add to PIPELINE_DIRECTORIES:
("my_step", "0N_MY_STEP"),
Step 3: Register the step in parse_config.py
Add the step name to the get_enabled_steps list in schemas/config.py.
Step 4: Add target generation in build_targets.py
Add a call in build_all_targets() and a get_<step>_outputs() function following the pattern of get_doublet_outputs().
Checkpoint: Verify config validation
sc-preprocess validate-config --config-file your_test_config.yaml
Step 5: Create the rule file
Create sc_preprocess/workflows/rules/<new_step>.smk. Always include threads: and resources: — without them SLURM defaults to 1 GB and jobs will OOM on real data. Import gettempdir explicitly:
from tempfile import gettempdir
Step 6: Include the rule file in main.smk
if "my_step" in ENABLED_STEPS:
include: "rules/my_step.smk"
Step 7: Create a dummy rule and test the DAG
Before implementing the actual tool logic, write a dummy shell block:
shell:
"""
echo "Placeholder for my_method"
touch {output.predictions}
"""
Then verify the DAG resolves correctly:
sc-preprocess run --config-file your_test_config.yaml --cores 1 --dry-run
sc-preprocess run --config-file your_test_config.yaml --cores 1 --dag | dot -Tpng > dag.png
Check that:
Your new rule appears in the DAG
It depends on the correct upstream rules (e.g.,
cellranger_gex_count)rule allconnects to your new rule’s.doneoutputNo
MissingInputExceptionerrors
Step 8: Implement the rule
Replace the dummy shell with the actual tool invocation. Use either shell: for command-line tools or run: for Python-based tools.
Step 9: Add to config generator
Add the new step’s parameters to the relevant modality template(s) in sc_preprocess/config_generator.py so sc-preprocess init-config includes them in the output.
Step 10: Write tests
See Testing below.
Step 11: Document
Update this file and the tutorial if the new step is relevant to the standard user workflow.
Testing
Validate your changes before merging.
DAG validation
Always verify the DAG first. This catches target/output mismatches without executing any rules:
# Dry run - checks all inputs/outputs resolve
sc-preprocess run --config-file tests/00_TEST_DATA_GEX/test_config_gex.yaml --cores 1 --dry-run
# Visual DAG - confirm rule dependencies look correct
sc-preprocess run --config-file tests/00_TEST_DATA_GEX/test_config_gex.yaml --cores 1 --dag | dot -Tpng > dag.png
Integration tests
Integration tests run the full pipeline on test data:
# Run integration tests
bash tests/test.sh
# Or run a specific workflow manually
sc-preprocess run --config-file tests/00_TEST_DATA_GEX/test_config_gex.yaml --cores 1
Test checklist for a new step
When adding a new step, verify all of the following before merging:
[ ] Config validation:
sc-preprocess validate-config --config-file your_config.yamlsucceeds[ ] Dry run:
sc-preprocess run --config-file ... --cores 1 --dry-runshows your rule[ ] DAG: Your rule appears with correct dependencies in the DAG visualization
[ ] Dummy execution: Pipeline completes with placeholder
touchcommands[ ] Real execution: Pipeline completes with the actual tool on test data
[ ] Config generator:
sc-preprocess init-config --modality <modality>includes the new step
End-to-end test walkthrough
For a step-by-step walkthrough of running the full pipeline on test data (GEX, ATAC, and ARC), see:
Common Pitfalls
Problem |
Cause |
Fix |
|---|---|---|
|
Target filename in |
Ensure both use the exact same pattern, e.g., |
|
Variable defined in one method block but used in another |
Move shared variables (output dirs, GEX count dir) to the common |
Pydantic |
Schema too strict or missing |
Method configs should be |
Rule not in DAG |
Step not added to |
Check |
Config generator skips new step |
Step not added to |
Add interactive prompts for the new step’s parameters |
Building and Editing Documentation
This project uses Sphinx with MyST-Parser for markdown support and is deployed on Read the Docs. Documentation is hosted at: https://sc-preprocess.readthedocs.io/
Documentation structure
docs/
├── source/
│ ├── conf.py # Sphinx configuration
│ ├── index.md # Landing page with toctree
│ ├── installation.md
│ ├── quickstart.md
│ ├── tutorial.md
│ └── development.md # This file
├── requirements.txt # Sphinx dependencies
└── Makefile # Build commands
Building docs locally
Here is how you can render the documentation locally with live reloading in your web browser:
sc-preprocess render-docs
If you would like to do it manually, here you go:
cd docs
# install html rendering software
pip install sphinx
pip install sphinx-copybutton
pip install myst-parser
pip install sphinx-autobuild
# Build HTML
make html
# Serve locally
python3 -m http.server 8000 -d build/html
Then open http://localhost:8000 in your browser.
If you are using VS Code on a remote session, you can render the docs in the IDE itself.
For automatic rebuilds when you save changes, use sphinx-autobuild:
sphinx-autobuild source build/html --port 8000
This watches source/ for changes, rebuilds automatically, and refreshes your browser.
Adding a new page
Create a new
.mdfile indocs/source/Add it to the
toctreeindocs/source/index.md:
```{toctree}
:maxdepth: 2
:caption: Contents
installation
quickstart
tutorial
your-new-page
development