Batch Correction¶
Batch correction removes technical variation (batch effects) between different experiments, acquisition runs, or files, allowing data from multiple sources to be combined for analysis.
Overview¶
When combining IMC data from multiple experiments, files, or acquisition runs, technical variation can introduce artifacts that confound biological signals. Batch correction methods identify and remove these technical effects while preserving biological variation.
Options¶
OpenIMC supports two batch correction methods:
Harmony (default): Iterative clustering-based correction in PCA space
ComBat: Empirical Bayes method for batch effect correction
Parameters¶
Common Parameters¶
method (default:
"harmony"): Batch correction method - Options:"harmony","combat"batch_variable (optional): Column name containing batch identifiers - If not specified, auto-detects from
source_fileoracquisition_idcolumns - Each unique value represents a batchfeatures (optional): List of feature column names to correct - If not specified, auto-detects all numeric feature columns - Excludes metadata columns (cell_id, centroid_x, centroid_y, cluster, etc.)
Harmony Parameters¶
n_clusters (default:
30): Number of Harmony clusters - More clusters capture finer structure but may overfit - Typical range: 10-50 - Increase for datasets with many cell typessigma (default:
0.1): Width of soft k-means clusters - Controls how “soft” the cluster assignments are - Lower values (0.05-0.1) create sharper clusters - Higher values (0.1-0.3) create softer, overlapping clusterstheta (default:
2.0): Diversity clustering penalty parameter - Higher values (2.0-4.0) encourage more diverse clusters across batches - Lower values (0.5-2.0) allow more batch-specific clusters - Adjust if batches are not mixing welllambda_reg (default:
1.0): Regularization parameter - Controls strength of correction - Higher values (1.0-2.0) apply stronger correction - Lower values (0.5-1.0) apply weaker correction - Increase if batch effects are strongmax_iter (default:
20): Maximum number of iterations - Harmony typically converges in 10-20 iterations - Increase (30-50) if convergence is slow - Decrease if processing time is a concernpca_variance (default:
0.9): Proportion of variance to retain in PCA - Higher values (0.9-0.95) retain more information but may include noise - Lower values (0.8-0.9) reduce dimensionality more aggressively - Typical range: 0.8-0.95
ComBat Parameters¶
covariates (optional): List of covariate column names - Covariates are biological variables to preserve (e.g., condition, treatment) - ComBat will correct for batch while preserving covariate effects - Example:
["condition", "treatment"]
Using Batch Correction in the GUI¶
Ensure feature extraction has been completed
Navigate to Analysis → Batch Correction or click the batch correction button
In the batch correction dialog: - Select the batch correction method (Harmony or ComBat) - Choose or verify the batch variable (auto-detected if available) - Select features to correct (auto-detected if not specified) - Adjust method-specific parameters - For ComBat, optionally specify covariates to preserve
Choose output location for the corrected features CSV file
Click Apply Batch Correction to start the process
The corrected features are saved and can be used for downstream analysis
Using Batch Correction in the CLI¶
Basic Harmony Command¶
openimc batch-correction features.csv corrected_features.csv \\
--method harmony \\
--batch-var source_file
With Custom Parameters¶
openimc batch-correction features.csv corrected_features.csv \\
--method harmony \\
--batch-var source_file \\
--n-clusters 40 \\
--sigma 0.15 \\
--theta 2.5 \\
--lambda-reg 1.2
ComBat Command¶
openimc batch-correction features.csv corrected_features.csv \\
--method combat \\
--batch-var source_file \\
--covariates condition,treatment
Workflow YAML Example¶
batch_correction:
enabled: true
method: "harmony"
batch_variable: "source_file"
features: null # Auto-detect
n_clusters: 30
sigma: 0.1
theta: 2.0
lambda_reg: 1.0
max_iter: 20
pca_variance: 0.9
Method Details¶
Harmony¶
Harmony is an iterative clustering-based method that corrects batch effects in low-dimensional space (typically PCA).
How it works:
PCA Projection: Features are projected into PCA space, retaining a specified proportion of variance (pca_variance)
Soft Clustering: Cells are assigned to clusters using soft k-means with specified width (sigma)
Batch Correction: Within each cluster, batch-specific effects are estimated and removed
Iteration: Steps 2-3 are repeated until convergence (up to max_iter iterations)
Diversity Penalty: The theta parameter encourages clusters to contain cells from multiple batches
Regularization: The lambda_reg parameter controls the strength of correction
Back-projection: Corrected PCA coordinates are transformed back to original feature space
Advantages: - Preserves biological variation while removing batch effects - Works well with many batches - Handles non-linear batch effects - Fast and scalable
Limitations: - Requires sufficient cells per batch for stable correction - May over-correct if batch effects are weak - Parameters may need tuning for optimal results
Citation: - Korsunsky, I., et al. (2019). “Fast, sensitive and accurate integration of single-cell data with Harmony.” Nature Methods, 16(12), 1289-1296. DOI: 10.1038/s41592-019-0619-0 - Harmony GitHub - harmonypy Python Package
ComBat¶
ComBat is an empirical Bayes method that estimates and removes batch effects while preserving biological variation.
How it works:
Model Fitting: For each feature, ComBat fits a linear model: - Feature = Batch Effect + Biological Variation + Error
Empirical Bayes: Batch effects are estimated using empirical Bayes, which borrows information across features
Covariate Adjustment: If covariates are specified, their effects are preserved while correcting for batch
Correction: Estimated batch effects are subtracted from the data
Advantages: - Well-established method with proven track record - Can preserve specified covariates - Works well with small batch sizes - Fast computation
Limitations: - Assumes linear batch effects - May not handle non-linear batch effects well - Requires sufficient features for stable estimation
Citation: - Johnson, W. E., et al. (2007). “Adjusting batch effects in microarray expression data using empirical Bayes methods.” Biostatistics, 8(1), 118-127. DOI: 10.1093/biostatistics/kxj037 - Zhang, Y., et al. (2020). “ComBat-seq: batch effect adjustment for RNA-seq count data.” NAR Genomics and Bioinformatics, 2(3), lqaa078. DOI: 10.1093/nargab/lqaa078 - pycombat Python Package
Tips and Best Practices¶
When to Use Batch Correction: Apply batch correction when combining data from: - Multiple acquisition runs - Different experimental days - Different instruments or protocols - Multiple files with systematic differences
Method Selection: - Use Harmony for most cases, especially with many batches or non-linear effects - Use ComBat when you need to preserve specific covariates or have linear batch effects
Parameter Tuning: - Start with default parameters - Visualize results (e.g., PCA plots) to assess correction quality - Adjust parameters if batches are not mixing well or if over-correction occurs
Validation: After batch correction, verify that: - Batch effects are reduced (check PCA plots colored by batch) - Biological variation is preserved (check that known cell types still separate) - Feature distributions are reasonable (no extreme values or artifacts)
Feature Selection: Only correct features that are affected by batch effects. Exclude: - Morphological features (typically less affected by batch) - Features that should vary by batch (e.g., acquisition-specific markers)
Covariates: When using ComBat, carefully select covariates to preserve. Include only true biological variables, not technical variables that correlate with batch.
Downstream Analysis: Use batch-corrected features for: - Clustering across batches - Differential expression analysis - Spatial analysis combining multiple files - Any analysis requiring integrated data