API Reference¶
PyROMA: Python implementation of Representation Of Module Activity for single-cell RNA-seq data.
- class pyroma.GeneSetResult(subset: AnnData, subsetlist: ndarray, outliers: List[int], nullgenesetsize: int, svd: TruncatedSVD | PCA | IncrementalPCA)[source]¶
Bases:
objectContainer for gene set analysis results.
- subset¶
Subset of data for this gene set
- Type:
sc.AnnData
- subsetlist¶
Array of gene names in the gene set
- Type:
np.ndarray
- svd¶
Fitted decomposition object
- Type:
Union[TruncatedSVD, PCA, IncrementalPCA]
- X¶
Processed data matrix
- Type:
np.ndarray
- raw_X_subset¶
Raw data subset for PC sign correction
- Type:
np.ndarray
- projections_1¶
Projections onto first principal component
- Type:
np.ndarray
- projections_2¶
Projections onto second principal component
- Type:
np.ndarray
- nulll1¶
Null distribution of L1 values
- Type:
np.ndarray
- null_median_exp¶
Null distribution of median expression
- Type:
np.ndarray
- null_projections¶
Null projections for both components
- Type:
np.ndarray
- class pyroma.ROMA[source]¶
Bases:
objectROMA (Representation Of Module Activity) implementation for single-cell RNA-seq data.
This class implements the ROMA algorithm for quantifying gene set activity in single-cell transcriptomics data using principal component analysis.
- adata¶
Annotated data matrix
- Type:
Optional[sc.AnnData]
- results¶
Results for each analyzed gene set
- Type:
Dict[str, GeneSetResult]
- null_distributions¶
Cached null distributions by gene set size
- Type:
Dict[int, Tuple[np.ndarray, np.ndarray]]
- approx_size(flag: bool) None[source]¶
Determine appropriate null gene set size with approximation.
- Parameters:
flag (bool) – Whether this is the first gene set
- assess_significance(results: Dict[str, GeneSetResult]) Dict[str, GeneSetResult][source]¶
Compute adjusted p-values using Benjamini-Hochberg procedure.
- Parameters:
results (Dict[str, GeneSetResult]) – Results for each gene set
- Returns:
results – Results with adjusted p-values
- Return type:
Dict[str, GeneSetResult]
- compare_conditions(condition_key: str, module_name: str, test: str = 'wilcoxon') Dict[str, Any][source]¶
Compare module activity between conditions.
- Parameters:
- Returns:
results – Test results including p-value and effect size
- Return type:
Dict[str, Any]
- compute(selected_gene_sets: List[str] | str = 'all', parallel: bool = True, incremental: bool = False, iters: int = 100, partial_fit: bool = False, algorithm: str = 'randomized', loocv_on: bool = True, double_mean_centering: bool = False, outlier_fisher_thr: float = 0.05, min_gene_sets: int = 3, max_outlier_prop: float = 0.5, verbose: int = 0) None[source]¶
Compute ROMA module activity scores for selected gene sets.
- Parameters:
selected_gene_sets (Union[List[str], str], default='all') – List of gene set names to analyze, or ‘all’ for all gene sets
parallel (bool, default=False) – Whether to use parallel processing for null distribution computation
incremental (bool, default=False) – Whether to use incremental PCA instead of truncated SVD
iters (int, default=100) – Number of iterations for null distribution generation
partial_fit (bool, default=False) – Whether to use partial_fit for incremental PCA
algorithm (str, default='randomized') – Algorithm for SVD computation (‘randomized’ or ‘arpack’)
loocv_on (bool, default=True) – Whether to perform leave-one-out cross-validation for outlier detection
double_mean_centering (bool, default=False) – Whether to center data across both samples and genes
outlier_fisher_thr (float, default=0.05) – P-value threshold for Fisher’s exact test in outlier detection
min_gene_sets (int, default=3) – Minimum number of gene sets a gene must appear in to be considered for outlier removal
max_outlier_prop (float, default=0.5) – Maximum proportion of genes that can be removed as outliers per gene set
verbose (int, default=0) – Verbosity level
- Raises:
ValueError – If no gene sets are loaded or if data is not properly initialized
- compute_batch(selected_gene_sets: List[str], batch_size: int = 50, **kwargs) None[source]¶
Process gene sets in batches for memory efficiency.
- compute_median_exp(svd_: TruncatedSVD | PCA | IncrementalPCA, X: ndarray, raw_X_subset: ndarray, gene_set_name: str | None = None) Tuple[float, ndarray, ndarray][source]¶
Compute median expression for shifted pathway detection.
- Parameters:
svd (Union[TruncatedSVD, PCA, IncrementalPCA]) – Fitted decomposition object
X (np.ndarray) – Processed data matrix
raw_X_subset (np.ndarray) – Raw data subset
gene_set_name (Optional[str], default=None) – Name of current gene set
- Returns:
median_exp (float) – Median of gene projections on PC1
projections_1 (np.ndarray) – Gene projections on PC1
projections_2 (np.ndarray) – Gene projections on PC2
- double_mean_center_matrix(matrix: ndarray) ndarray[source]¶
Double-center a matrix by removing row and column means.
- Parameters:
matrix (np.ndarray) – Input matrix to center
- Returns:
centered_matrix – Double-centered matrix
- Return type:
np.ndarray
- export_results(output_file: str, format: str = 'csv', include_projections: bool = False) None[source]¶
Export ROMA results to various formats.
- fisher_outlier_filter(gene_outlier_counts: Dict[str, int], gene_pathway_counts: Dict[str, int], outlier_fisher_thr: float = 0.05, min_gene_sets: int = 3) Dict[str, bool][source]¶
Determine outlier genes using Fisher’s exact test.
Tests whether each gene’s proportion of outlier calls is significantly higher than the global average.
- Parameters:
gene_outlier_counts (Dict[str, int]) – Number of gene sets where each gene was flagged as outlier
gene_pathway_counts (Dict[str, int]) – Total number of gene sets containing each gene
outlier_fisher_thr (float, default=0.05) – P-value threshold for Fisher’s test
min_gene_sets (int, default=3) – Minimum gene sets required for outlier consideration
- Returns:
gene_outlier_flags – Whether each gene is considered an outlier
- Return type:
- fix_pc_sign(GeneScore: ndarray, SampleScore: ndarray, Wei: ndarray | None = None, Mode: str = 'PreferActivation', DefWei: float = 1.0, Thr: float | None = None, Grouping: Dict[str, str] | None = None, ExpMat: ndarray | None = None, CorMethod: str = 'pearson', gene_set_name: str | None = None) int[source]¶
Determine the correct orientation of principal component.
- Parameters:
GeneScore (np.ndarray) – Gene loadings on PC
SampleScore (np.ndarray) – Sample projections on PC
Wei (Optional[np.ndarray], default=None) – Gene weights
Mode (str, default='PreferActivation') – Method for PC orientation
DefWei (float, default=1.0) – Default weight for missing values
Thr (Optional[float], default=None) – Threshold for various methods
Grouping (Optional[Dict[str, str]], default=None) – Sample grouping information
ExpMat (Optional[np.ndarray], default=None) – Expression matrix
CorMethod (str, default='pearson') – Correlation method
gene_set_name (Optional[str], default=None) – Name of current gene set
- Returns:
sign – +1 or -1 indicating PC orientation
- Return type:
- get_module_scores(module_name: str) ndarray | None[source]¶
Get activity scores for a specific module across all samples.
- Parameters:
module_name (str) – Name of the gene set/module
- Returns:
scores – Activity scores across samples, or None if module not found
- Return type:
Optional[np.ndarray]
- get_raw_p_values(gene_set_name: str | None = None) None[source]¶
Calculate raw p-values from null distributions.
- Parameters:
gene_set_name (Optional[str], default=None) – Name of current gene set
- get_top_genes(module_name: str, n_genes: int = 10, absolute: bool = True) DataFrame | None[source]¶
Get top contributing genes for a module.
- indexing(adata: AnnData) None[source]¶
Create index of unique gene names from AnnData object.
- Parameters:
adata (sc.AnnData) – Annotated data matrix
- limit_outliers_per_geneset(gene_set: List[str], gene_flags: Dict[str, bool], loocv_scores: Dict[str, float], max_outlier_prop: float = 0.5) Dict[str, bool][source]¶
Limit the proportion of genes flagged as outliers in a gene set.
- Parameters:
- Returns:
gene_flags – Updated outlier flags
- Return type:
- load_active_modules_results(path: str) None[source]¶
Load previously saved ROMA results.
- Parameters:
path (str) – Path to saved results directory
- Raises:
FileNotFoundError – If results directory doesn’t exist
- loocv(subset: AnnData, verbose: int = 0, for_randomset: bool = False) List[int][source]¶
Perform leave-one-out cross-validation for outlier detection.
Uses vectorized operations to efficiently compute L1 scores with each gene removed.
- orient_pc1(pc1: ndarray, X: ndarray, raw_X_subset: ndarray, gene_set_name: str | None = None) int[source]¶
Orient first principal component according to configured method.
- p_values_in_frame(assessed_results: Dict[str, GeneSetResult]) DataFrame[source]¶
Create DataFrame with all p-values and statistics.
- Parameters:
assessed_results (Dict[str, GeneSetResult]) – Results with computed p-values
- Returns:
df – DataFrame with all statistics
- Return type:
pd.DataFrame
- process_iteration(sequence: ndarray, idx: ndarray, iteration: int, incremental: bool, partial_fit: bool, algorithm: str) Tuple[float, float, ndarray, ndarray][source]¶
Process single iteration for null distribution computation.
- Parameters:
- Returns:
l1 (float) – L1 value for this iteration
median_exp (float) – Median expression for this iteration
null_projections_1 (np.ndarray) – Projections on PC1
null_projections_2 (np.ndarray) – Projections on PC2
- randomset_parallel(subsetlist: ndarray, outliers: List[int], verbose: int = 0, prefer_type: str = 'processes', incremental: bool = False, iters: int = 100, partial_fit: bool = False, algorithm: str = 'randomized') None[source]¶
Compute null distributions using parallel processing.
- Parameters:
subsetlist (np.ndarray) – Array of gene names in current set
outliers (List[int]) – Indices of outlier genes
verbose (int, default=0) – Verbosity level
prefer_type (str, default='processes') – Joblib backend preference
incremental (bool, default=False) – Whether to use incremental PCA
iters (int, default=100) – Number of iterations
partial_fit (bool, default=False) – Whether to use partial fit
algorithm (str, default='randomized') – SVD algorithm
- read_gmt_to_dict(gmt_path: str) Dict[str, ndarray][source]¶
Read GMT file and convert to dictionary of gene sets.
- Parameters:
gmt_path (str) – Path to GMT format file
- Returns:
genesets – Dictionary mapping gene set names to gene arrays
- Return type:
Dict[str, np.ndarray]
- Raises:
FileNotFoundError – If GMT file doesn’t exist
ValueError – If GMT file format is invalid
- robustIncrementalPCA(adata: AnnData, subsetlist: ndarray, outliers: List[int], for_randomset: bool = False, partial_fit: bool = False, batch_size: int = 1000) Tuple[IncrementalPCA, ndarray][source]¶
Perform robust incremental PCA excluding outlier genes.
- Parameters:
adata (sc.AnnData) – Annotated data matrix
subsetlist (np.ndarray) – Gene names to include
outliers (List[int]) – Indices of outlier genes to exclude
for_randomset (bool, default=False) – Whether this is for random set computation
partial_fit (bool, default=False) – Whether to use partial_fit
batch_size (int, default=1000) – Batch size for incremental PCA
- Returns:
ipca (IncrementalPCA) – Fitted IncrementalPCA object
X (np.ndarray) – Processed data matrix
- robustPCA(adata: AnnData, subsetlist: ndarray, outliers: List[int], for_randomset: bool = False, algorithm: str = 'auto') Tuple[PCA, ndarray][source]¶
Perform robust PCA excluding outlier genes.
- Parameters:
- Returns:
pca (PCA) – Fitted PCA object
X (np.ndarray) – Processed data matrix
- robustTruncatedSVD(adata: AnnData, subsetlist: ndarray, outliers: List[int], for_randomset: bool = False, algorithm: str = 'randomized') Tuple[TruncatedSVD, ndarray][source]¶
- save_active_modules_results(path: str, only_active: bool = True, save_adata: bool = True) None[source]¶
Save ROMA results to disk.
- select_active_modules(q_L1_threshold: float = 0.05, q_Med_Exp_threshold: float = 0.05) None[source]¶
Select significantly active modules based on q-value thresholds.
- select_and_sort_gene_sets(selected_geneset_names: List[str]) List[str][source]¶
Sort gene sets by size for efficient processing.
- subsetting(adata: AnnData, geneset: ndarray, verbose: int = 0) Tuple[AnnData, ndarray][source]¶
Create subset of data for specific gene set.
- Parameters:
adata (sc.AnnData) – Full annotated data matrix
geneset (np.ndarray) – Array of gene names to subset
verbose (int, default=0) – Verbosity level
- Returns:
subset (sc.AnnData) – Subset of data containing only genes in gene set
subsetlist (np.ndarray) – Array of gene names that were found in the data
- class pyroma.color[source]¶
Bases:
object- BOLD = '\x1b[1m'¶
- DARKCYAN = '\x1b[36m'¶
- END = '\x1b[0m'¶
- GREEN = '\x1b[92m'¶
- PURPLE = '\x1b[95m'¶
- pyroma.pbmc3k() AnnData[source]¶
10X peripheral blood mononuclear cells (PBMCs).
Loads single-cell RNA-seq data of peripheral blood mononuclear cells (PBMCs) from a healthy donor.
- Returns:
adata – Annotated data matrix.
- Return type:
AnnData
Example
>>> import pyroma >>> adata = pyroma.datasets.pbmc3k() >>> print(adata)
- pyroma.pbmc_ifnb() AnnData[source]¶
10X peripheral blood mononuclear cells (PBMCs) + IFNb stimulation.
Data from Kang et al. 2018.
Loads single-cell RNA-seq data of peripheral blood mononuclear cells (PBMCs) from healthy and stimulated samples.
- Returns:
adata – Annotated data matrix.
- Return type:
AnnData
Example
>>> import pyroma >>> adata = pyroma.datasets.pbmc_ifnb() >>> print(adata)
- pyroma.use_hallmarks() str[source]¶
Return the absolute path to the bundled Hallmark gene sets. Safe for both editable installs and wheels.