Development Guide

This guide walks you through contributing to the sc-preprocess pipeline. Whether you’re adding a new analysis method or an entirely new single-cell preprocessing step, this document provides the workflow, patterns, and testing strategies you need.

Getting Started

Follow the Installation instructions (steps 1-5), using pip install -e . for an editable install.

After pulling new changes from the repository:

git pull
pip install -e .   

Building the documentation

To build and preview the docs locally, install the documentation dependencies:

pip install -e ".[docs]"

Then serve with auto-reload:

sphinx-autobuild docs/source docs/build/html

Updating dependencies

The project uses three types of environment files:

File

Purpose

Editable?

environment.yaml

Minimal install spec (hand-maintained)

Yes — edit this when adding/changing deps

environment_lock_linux.yaml

Exact pins for Linux reproducibility

No — regenerate with conda env export

environment_lock_portable.yaml

Exact pins without build hashes

No — regenerate with conda env export --no-builds

To add a new dependency:

Do NOT run conda env export > environment.yaml — this overwrites the minimal spec with a 240+ line platform-specific dump.

  1. Add it to environment.yaml (conda section or pip section)

  2. Recreate your environment: conda env remove -n snakemake8 && conda env create -f environment.yaml

  3. Optionally regenerate lock files:

    conda env export | grep -v "^prefix:" > environment_lock_linux.yaml
    conda env export --no-builds | grep -v "^prefix:" > environment_lock_portable.yaml
    

Per-rule conda environments: If you add a tool with dependency conflicts that cannot be resolved in the main environment, Snakemake supports isolated per-rule conda environments via the conda: directive. Place a .yaml file in workflows/envs/ and reference it in the rule:

rule my_rule:
    ...
    conda:
        "../envs/my_tool.yaml"
    script:
        "../scripts/my_script.py"

Note for RHEL8/HPC users: Snakemake’s conda: directive activates isolated environments in a subprocess that does not inherit LD_LIBRARY_PATH. On systems where the system libstdc++ is older than GLIBCXX_3.4.29 (e.g., RHEL8), this causes import failures for any modern conda-forge package. In that case, add the tool to environment.yaml instead.

Key Concepts

Preprocessing Step: A preprocessing stage in single-cell analysis, such as demultiplexing or doublet detection. Each step has its own Snakemake rule file (workflows/rules/<step>.smk) and Pydantic schema (schemas/<step>.py) to encourage expansion to different tool options that solve the same preprocessing step e.g. cellsnp-lite + vireo or demuxalot.

Method: A software tool that implements a preprocessing step. For example, the demultiplexing step supports two methods: vireo and demuxalot. Adding a new method means integrating another tool into an existing prepcrocessing step.

Developer Workflow

Follow these steps when making changes:

  1. Orient - Understand the project architecture and how the pipeline resolves what to run

  2. Develop - Choose your task: add a method or add a step

  3. Test - Validate with DAG checks and integration tests

  4. Document - Update documentation!


Project Architecture

Understanding the file layout is essential before making changes:

sc_preprocess/   
├── cli.py                          # CLI entry points
├── config_generator.py             # Interactive config builder
├── config_validator.py             # PIPELINE_DIRECTORIES, validation
├── schemas/                        # Pydantic models for config validation
│   ├── base.py                     # BaseStepConfig (all steps inherit this)
│   ├── config.py                   # PipelineConfig (unified schema)
│   ├── cellranger.py               # GEX/ATAC/ARC Cell Ranger configs
│   ├── demultiplexing.py           # DemuxalotConfig, VireoConfig
│   └── doublet_detection.py        # ScrubletConfig
├── workflows/
│   ├── main.smk                    # Master workflow, rule all, includes
│   ├── rules/                      # One .smk file per pipeline step
│   │   ├── cellranger.smk          # Cell Ranger count/aggregation
│   │   ├── object_creation.smk     # Per-capture AnnData/MuData creation
│   │   ├── batch_aggregation.smk   # Batch-level aggregation + metadata enrichment
│   │   ├── demultiplexing.smk      # Demuxalot/Vireo
│   │   └── doublet_detection.smk   # Scrublet
│   └── scripts/
│       ├── build_targets.py        # Generates target files for rule all
│       ├── parse_config.py         # Extracts enabled steps, methods, etc.
│       ├── create_gex_anndata.py   # GEX per-capture object creation
│       ├── create_atac_anndata.py  # ATAC per-capture object creation
│       ├── create_arc_mudata.py    # ARC per-capture MuData creation
│       ├── aggregate_batch.py      # Batch aggregation (per-capture → batch)
│       ├── merge_metadata.py       # Merge analysis metadata into batch objects
│       └── run_scrublet.py         # Scrublet doublet detection
tests/
├── test.sh                         # Integration test script
├── 00_TEST_DATA_GEX/               # Test configs and library lists
└── ...
docs/source/                        # Read the Docs documentation (this file)

How the pipeline resolves what to run

  1. Config (pipeline_config.yaml) declares which steps are enabled: true

  2. parse_config.pyget_enabled_steps() reads the config and returns a list of enabled step names

  3. main.smk conditionally includes .smk rule files based on enabled steps

  4. build_targets.pybuild_all_targets() generates the list of expected .done files for rule all

  5. Each rule produces a .done marker file in {output_dir}/00_LOGS/ that matches what build_targets.py expects

If the target filename from build_targets.py doesn’t match the done output in the rule, Snakemake will raise a MissingInputException.

Pipeline phases

The pipeline executes in phases, each building on the previous:

Phase 1: Cell Ranger count (per-capture)
  → 01_CELLRANGER{GEX|ATAC|ARC}_COUNT/{batch}_{capture}/

Phase 2: Cell Ranger aggregation (per-batch)
  → 02_CELLRANGER{GEX|ATAC|ARC}_AGGR/{batch}/

Phase 3: Per-capture object creation (AnnData/MuData with traceability metadata)
  → 03_ANNDATA/{batch}_{capture}.h5ad|h5mu

Phase 4: Batch aggregation (merge per-capture objects)
  → 04_BATCH_OBJECTS/{batch}_{modality}.h5ad|h5mu

Phase 5-6: Per-capture analysis (demux, doublet — run in parallel)
  → 05_DEMULTIPLEXING/, 06_DOUBLET_DETECTION/

Phase 7: Metadata enrichment (merge analysis results into batch objects)
  → 07_FINAL/{batch}_{modality}.h5ad|h5mu

Metadata enrichment is the final phase. The rules live in batch_aggregation.smk (alongside the aggregation rules) and the logic is in scripts/merge_metadata.py. For each batch, enrichment:

  1. Reads the batch object from 04_BATCH_OBJECTS/

  2. Searches 05_DEMULTIPLEXING/ and 06_DOUBLET_DETECTION/ for per-capture result files

  3. Joins analysis metadata onto adata.obs using cell_id as the key

  4. Writes the enriched object to 07_FINAL/

Enrichment is always the last step. It runs for every enabled modality regardless of which analysis steps are enabled — if no analysis metadata is found, it copies the batch object as-is.

Target generation for enrichment is handled by get_enriched_object_outputs() in build_targets.py, producing done files like {batch}_gex_enrichment.done.


Development Workflow

Quick Reference Checklists

Adding a method to an existing preprocessing step (e.g., a new demultiplexing tool):

  1. Add method config schema in schemas/<step>.py with tool_meta

  2. Register the method in the parent config class

  3. Add the rule in workflows/rules/<step>.smkinclude a threads: and resources: block (see Resource parameters)

  4. Add target generation in workflows/scripts/build_targets.py

  5. Add to config generator in config_generator.py

  6. Test

  7. Document

Adding a new pipeline step (e.g., a new QC filter or analysis method):

  1. Create the Pydantic schema in schemas/<new_step>.py

  2. Register the output directory in config_validator.py

  3. Register the step in parse_config.py

  4. Add target generation in build_targets.py

  5. Create the rule file in workflows/rules/<new_step>.smkinclude a threads: and resources: block (see Resource parameters)

  6. Include the rule file in main.smk

  7. Create a dummy rule and test the DAG

  8. Implement the rule

  9. Add to config generator

  10. Write tests

  11. Document


Adding a Method to an Existing Preprocessing Step

This example shows how Vireo was added alongside demuxalot for demultiplexing. Use this as a template for adding new methods.

Step 1: Add method config schema

In sc_preprocess/schemas/demultiplexing.py:

from typing import ClassVar
from .base import ToolMeta

class CellSNPConfig(BaseModel):
    """cellsnp-lite parameters for SNP calling."""

    vcf: str = Field(description="Path to VCF reference file with known variants")
    threads: int = Field(default=4, ge=1, description="Number of threads for cellsnp-lite")
    min_maf: float = Field(default=0.0, ge=0.0, le=1.0, description="Minimum minor allele frequency")
    min_count: int = Field(default=1, ge=0, description="Minimum UMI count")

    class Config:
        extra = "forbid"


class VireoConfig(BaseModel):
    """Vireo demultiplexing parameters (requires cellsnp-lite preprocessing)."""

    tool_meta: ClassVar[ToolMeta] = ToolMeta(
        package="vireoSNP",
        url="https://github.com/single-cell-genetics/vireo",
    )

    cellsnp: CellSNPConfig = Field(description="cellsnp-lite configuration for SNP calling")
    donors: int = Field(description="Number of donors to demultiplex")

    class Config:
        extra = "forbid"

Every method schema must include a tool_meta class variable. This enables show-params to display the installed tool version and a link to the source. Use ClassVar so Pydantic treats it as a class attribute (not a config field). For tools that aren’t Python packages (e.g., Cell Ranger), set shell_version_cmd:

tool_meta: ClassVar[ToolMeta] = ToolMeta(
    package="vireo",
    url="https://github.com/single-cell-genetics/vireo/releases/tag/v0.2.3",
    shell_version_cmd="cellranger --version",
)

Step 2: Register the method in the parent config class

class DemultiplexingConfig(BaseStepConfig):
    method: Literal["demuxalot", "vireo"] = Field(...)

    vireo: Optional[VireoConfig] = None  # Add this

    @model_validator(mode='after')
    def validate_method_params(self):
        method_configs = {
            "demuxalot": self.demuxalot,
            "vireo": self.vireo,  # Add this
        }
        # ... rest stays the same

Checkpoint: Verify schema registration

After completing Steps 1-2, verify the new method is visible to the CLI:

# Confirm the method appears in the registry
sc-preprocess list-methods

# Confirm schema fields and tool_meta are correct
sc-preprocess show-params --step demultiplexing --method vireo

Step 3: Add the rule

In sc_preprocess/workflows/rules/demultiplexing.smk:

if config.get("demultiplexing") and DEMUX_METHOD == "vireo":

    # Parse vireo config
    VIREO_CONFIG = DEMUX_CONFIG.get("vireo", {})
    CELLSNP_CONFIG = VIREO_CONFIG.get("cellsnp", {})
    VIREO_DONORS = VIREO_CONFIG.get("donors")

    # cellsnp-lite parameters
    CELLSNP_VCF = CELLSNP_CONFIG.get("vcf")
    CELLSNP_THREADS = CELLSNP_CONFIG.get("threads", 4)

    rule cellsnp_lite:
        """Run cellsnp-lite for SNP calling from BAM."""
        input:
            gex_done = os.path.join(config.get("output_dir", "output"), "00_LOGS", "{batch}_{capture}_gex_count.done"),
            bam = os.path.join(GEX_COUNT_DIR, "{batch}_{capture}", "outs", "possorted_genome_bam.bam"),
            barcodes = os.path.join(GEX_COUNT_DIR, "{batch}_{capture}", "outs", "filtered_feature_bc_matrix", "barcodes.tsv.gz")
        output:
            base_vcf = os.path.join(DEMUX_OUTPUT_DIR, "cellsnp_output_{batch}_{capture}", "cellSNP.base.vcf.gz"),
            done = touch(os.path.join(config.get("output_dir", "output"), "00_LOGS", "cellsnp_output_{batch}_{capture}.done"))
        # ...

    rule vireo:
        """Run Vireo for donor deconvolution using cellsnp-lite output."""
        input:
            cellsnp_done = rules.cellsnp_lite.output.done
        output:
            donor_ids = os.path.join(DEMUX_OUTPUT_DIR, "vireo_output_{batch}_{capture}", "donor_ids.tsv"),
            done = touch(os.path.join(OUTPUT_DIRS["logs_dir"], "vireo_output_{batch}_{capture}.done"))
        # ...

Key conventions:

  • The .done filename pattern must match what build_targets.py generates: {method}_output_{batch}_{capture}.done

  • .done files always go in {output_dir}/00_LOGS/

  • Use touch() for .done outputs

  • Shared variables (output dirs, GEX count dir) go in the top-level if config.get(...) block

  • Always include threads: and resources: blocks — without them SLURM defaults to 1 GB and jobs will OOM on real data:

from tempfile import gettempdir  # must be imported explicitly in each .smk file

rule my_rule:
    ...
    threads: config["my_step"].get("threads", 1)
    resources:
        mem_mb = config["my_step"].get("mem_gb", 16) * 1024,  # SLURM uses MB
        runtime = config["my_step"].get("runtime_minutes", 720),  # passed as --time to SLURM
        tmpdir = RESOURCES.get("tmpdir") or gettempdir()

Add threads, mem_gb, and runtime_minutes to the parent config class (not the method sub-config) in schemas/<step>.py:

threads: int = Field(default=1, ge=1, description="Number of threads")
mem_gb: int = Field(default=16, ge=1, description="Memory in GB")
runtime_minutes: int = Field(default=720, gt=0, description="Maximum runtime in minutes for the SLURM job")
  • Method-specific config goes in the method-level if block

Step 4: Add target generation

In sc_preprocess/workflows/scripts/build_targets.py:

if method == "vireo":
    for batch in batches:
        for capture in captures:
            outputs.append(os.path.join(logs_dir, f"vireo_output_{batch}_{capture}.done"))

Checkpoint: Verify rule and targets

After completing Steps 3-4, create a test config that uses the new method and verify the DAG resolves:

# Validate the config parses correctly
sc-preprocess validate-config --config-file your_test_config.yaml

# Dry run - confirm the new rule appears and targets match
sc-preprocess run --config-file your_test_config.yaml --cores 1 --dry-run

# Visual check - confirm rule dependencies look correct
sc-preprocess run --config-file your_test_config.yaml --cores 1 --dag | dot -Tpng > dag.png

If you get a MissingInputException, the .done filename in build_targets.py doesn’t match the rule output. See Common Pitfalls.

Step 5: Add to config generator

Add the new method’s parameters to the relevant modality template(s) in sc_preprocess/config_generator.py so sc-preprocess init-config includes them in the output.

Step 6: Test

See Testing below.


Adding a New Pipeline Step

Follow these steps to add a new preprocessing step (e.g., a quality-control filter, RNA velocity, etc.):

Step 1: Create the Pydantic schema

Create sc_preprocess/schemas/<new_step>.py:

"""<Step name> configuration schemas."""

from typing import ClassVar, Literal, Optional
from pydantic import BaseModel, Field, model_validator
from .base import BaseStepConfig, ToolMeta


class MyMethodConfig(BaseModel):
    """Parameters for my method."""

    tool_meta: ClassVar[ToolMeta] = ToolMeta(
        package="my-package",
        url="https://github.com/my/package",
    )

    my_param: str = Field(description="An example parameter")

    class Config:
        extra = "forbid"


class MyStepConfig(BaseStepConfig):
    """My new step configuration."""

    method: Literal["my_method"] = Field(description="Method to use")
    my_method: Optional[MyMethodConfig] = None

    @model_validator(mode='after')
    def validate_method_params(self):
        # Validate that the selected method has its config block
        ...

Checkpoint: Verify schema registration

After Step 1, verify the new step and its methods are visible:

sc-preprocess list-methods
sc-preprocess show-params --step my_step --method my_method

Step 2: Register the output directory

In sc_preprocess/config_validator.py, add to PIPELINE_DIRECTORIES:

("my_step", "0N_MY_STEP"),

Step 3: Register the step in parse_config.py

Add the step name to the get_enabled_steps list in schemas/config.py.

Step 4: Add target generation in build_targets.py

Add a call in build_all_targets() and a get_<step>_outputs() function following the pattern of get_doublet_outputs().

Checkpoint: Verify config validation

sc-preprocess validate-config --config-file your_test_config.yaml

Step 5: Create the rule file

Create sc_preprocess/workflows/rules/<new_step>.smk. Always include threads: and resources: — without them SLURM defaults to 1 GB and jobs will OOM on real data. Import gettempdir explicitly:

from tempfile import gettempdir

Step 6: Include the rule file in main.smk

if "my_step" in ENABLED_STEPS:
    include: "rules/my_step.smk"

Step 7: Create a dummy rule and test the DAG

Before implementing the actual tool logic, write a dummy shell block:

shell:
    """
    echo "Placeholder for my_method"
    touch {output.predictions}
    """

Then verify the DAG resolves correctly:

sc-preprocess run --config-file your_test_config.yaml --cores 1 --dry-run
sc-preprocess run --config-file your_test_config.yaml --cores 1 --dag | dot -Tpng > dag.png

Check that:

  • Your new rule appears in the DAG

  • It depends on the correct upstream rules (e.g., cellranger_gex_count)

  • rule all connects to your new rule’s .done output

  • No MissingInputException errors

Step 8: Implement the rule

Replace the dummy shell with the actual tool invocation. Use either shell: for command-line tools or run: for Python-based tools.

Step 9: Add to config generator

Add the new step’s parameters to the relevant modality template(s) in sc_preprocess/config_generator.py so sc-preprocess init-config includes them in the output.

Step 10: Write tests

See Testing below.

Step 11: Document

Update this file and the tutorial if the new step is relevant to the standard user workflow.


Testing

Validate your changes before merging.

DAG validation

Always verify the DAG first. This catches target/output mismatches without executing any rules:

# Dry run - checks all inputs/outputs resolve
sc-preprocess run --config-file tests/00_TEST_DATA_GEX/test_config_gex.yaml --cores 1 --dry-run

# Visual DAG - confirm rule dependencies look correct
sc-preprocess run --config-file tests/00_TEST_DATA_GEX/test_config_gex.yaml --cores 1 --dag | dot -Tpng > dag.png

Integration tests

Integration tests run the full pipeline on test data:

# Run integration tests
bash tests/test.sh

# Or run a specific workflow manually
sc-preprocess run --config-file tests/00_TEST_DATA_GEX/test_config_gex.yaml --cores 1

Test checklist for a new step

When adding a new step, verify all of the following before merging:

  • [ ] Config validation: sc-preprocess validate-config --config-file your_config.yaml succeeds

  • [ ] Dry run: sc-preprocess run --config-file ... --cores 1 --dry-run shows your rule

  • [ ] DAG: Your rule appears with correct dependencies in the DAG visualization

  • [ ] Dummy execution: Pipeline completes with placeholder touch commands

  • [ ] Real execution: Pipeline completes with the actual tool on test data

  • [ ] Config generator: sc-preprocess init-config --modality <modality> includes the new step

End-to-end test walkthrough

For a step-by-step walkthrough of running the full pipeline on test data (GEX, ATAC, and ARC), see:


Common Pitfalls

Problem

Cause

Fix

MissingInputException for .done files

Target filename in build_targets.py doesn’t match rule output.done

Ensure both use the exact same pattern, e.g., {method}_output_{batch}_{capture}.done

NameError for shared variables

Variable defined in one method block but used in another

Move shared variables (output dirs, GEX count dir) to the common if config.get("step"): block

Pydantic ValidationError on valid config

Schema too strict or missing Optional on method-specific config

Method configs should be Optional[...] = None with a model_validator to enforce presence based on method

Rule not in DAG

Step not added to get_enabled_steps() or main.smk includes

Check parse_config.py and main.smk both reference your step name

Config generator skips new step

Step not added to config_generator.py

Add interactive prompts for the new step’s parameters


Building and Editing Documentation

This project uses Sphinx with MyST-Parser for markdown support and is deployed on Read the Docs. Documentation is hosted at: https://sc-preprocess.readthedocs.io/

Documentation structure

docs/
├── source/
│   ├── conf.py           # Sphinx configuration
│   ├── index.md          # Landing page with toctree
│   ├── installation.md
│   ├── quickstart.md
│   ├── tutorial.md
│   └── development.md    # This file
├── requirements.txt      # Sphinx dependencies
└── Makefile              # Build commands

Building docs locally

Here is how you can render the documentation locally with live reloading in your web browser:

sc-preprocess render-docs

If you would like to do it manually, here you go:

cd docs

# install html rendering software
pip install sphinx
pip install sphinx-copybutton
pip install myst-parser
pip install sphinx-autobuild

# Build HTML
make html

# Serve locally
python3 -m http.server 8000 -d build/html

Then open http://localhost:8000 in your browser.

If you are using VS Code on a remote session, you can render the docs in the IDE itself.

For automatic rebuilds when you save changes, use sphinx-autobuild:

sphinx-autobuild source build/html --port 8000

This watches source/ for changes, rebuilds automatically, and refreshes your browser.

Adding a new page

  1. Create a new .md file in docs/source/

  2. Add it to the toctree in docs/source/index.md:

```{toctree}
:maxdepth: 2
:caption: Contents

installation
quickstart
tutorial
your-new-page
development