API Reference¶

PyROMA: Python implementation of Representation Of Module Activity for single-cell RNA-seq data.

class pyroma.GeneSetResult(subset: AnnData, subsetlist: ndarray, outliers: List[int], nullgenesetsize: int, svd: TruncatedSVD | PCA | IncrementalPCA)[source]¶

Bases: object

Container for gene set analysis results.

subset¶

Subset of data for this gene set

Type:: sc.AnnData

subsetlist¶

Array of gene names in the gene set

Type:: np.ndarray

outliers¶

Indices of outlier genes

Type:: List[int]

nullgenesetsize¶

Size of null gene set for comparison

Type:: int

svd¶

Fitted decomposition object

Type:: Union[TruncatedSVD, PCA, IncrementalPCA]

X¶

Processed data matrix

Type:: np.ndarray

raw_X_subset¶

Raw data subset for PC sign correction

Type:: np.ndarray

projections_1¶

Projections onto first principal component

Type:: np.ndarray

projections_2¶

Projections onto second principal component

Type:: np.ndarray

nulll1¶

Null distribution of L1 values

Type:: np.ndarray

null_median_exp¶

Null distribution of median expression

Type:: np.ndarray

null_projections¶

Null projections for both components

Type:: np.ndarray

q_value¶

Adjusted p-value for L1 statistic

Type:: float

med_exp_q_value¶

Adjusted p-value for median expression

Type:: float

non_adj_p¶

Raw p-value for L1 statistic

Type:: float

non_adj_med_exp_p¶

Raw p-value for median expression

Type:: float

test_l1¶

L1 test statistic

Type:: float

test_median_exp¶

Median expression test statistic

Type:: float

class pyroma.ROMA[source]¶

Bases: object

ROMA (Representation Of Module Activity) implementation for single-cell RNA-seq data.

This class implements the ROMA algorithm for quantifying gene set activity in single-cell transcriptomics data using principal component analysis.

adata¶

Annotated data matrix

Type:: Optional[sc.AnnData]

gmt¶

Path to GMT file containing gene sets

Type:: Optional[str]

genesets¶

Dictionary mapping gene set names to gene arrays

Type:: Dict[str, np.ndarray]

results¶

Results for each analyzed gene set

Type:: Dict[str, GeneSetResult]

null_distributions¶

Cached null distributions by gene set size

Type:: Dict[int, Tuple[np.ndarray, np.ndarray]]

approx_int¶

Granularity of null gene set size approximation

Type:: int

min_n_genes¶

Minimum number of genes required in a gene set

Type:: int

q_L1_threshold¶

Q-value threshold for L1 significance

Type:: float

q_Med_Exp_threshold¶

Q-value threshold for median expression significance

Type:: float

approx_size(flag: bool) → None[source]¶

Determine appropriate null gene set size with approximation.

Parameters:: flag (bool) – Whether this is the first gene set

assess_significance(results: Dict[str, GeneSetResult]) → Dict[str, GeneSetResult][source]¶

Compute adjusted p-values using Benjamini-Hochberg procedure.

Parameters:: results (Dict[str, GeneSetResult]) – Results for each gene set
Returns:: results – Results with adjusted p-values
Return type:: Dict[str, GeneSetResult]

compare_conditions(condition_key: str, module_name: str, test: str = 'wilcoxon') → Dict[str, Any][source]¶

Compare module activity between conditions.

Parameters:

condition_key (str) – Key in adata.obs containing condition labels
module_name (str) – Name of module to compare
test (str, default='wilcoxon') – Statistical test to use (‘wilcoxon’, ‘t-test’, ‘mann-whitney’)

Returns:

results – Test results including p-value and effect size

Return type:

Dict[str, Any]

compute(selected_gene_sets: List[str] | str = 'all', parallel: bool = True, incremental: bool = False, iters: int = 100, partial_fit: bool = False, algorithm: str = 'randomized', loocv_on: bool = True, double_mean_centering: bool = False, outlier_fisher_thr: float = 0.05, min_gene_sets: int = 3, max_outlier_prop: float = 0.5, verbose: int = 0) → None[source]¶

Compute ROMA module activity scores for selected gene sets.

Parameters:

selected_gene_sets (Union[List[str], str], default='all') – List of gene set names to analyze, or ‘all’ for all gene sets
parallel (bool, default=False) – Whether to use parallel processing for null distribution computation
incremental (bool, default=False) – Whether to use incremental PCA instead of truncated SVD
iters (int, default=100) – Number of iterations for null distribution generation
partial_fit (bool, default=False) – Whether to use partial_fit for incremental PCA
algorithm (str, default='randomized') – Algorithm for SVD computation (‘randomized’ or ‘arpack’)
loocv_on (bool, default=True) – Whether to perform leave-one-out cross-validation for outlier detection
double_mean_centering (bool, default=False) – Whether to center data across both samples and genes
outlier_fisher_thr (float, default=0.05) – P-value threshold for Fisher’s exact test in outlier detection
min_gene_sets (int, default=3) – Minimum number of gene sets a gene must appear in to be considered for outlier removal
max_outlier_prop (float, default=0.5) – Maximum proportion of genes that can be removed as outliers per gene set
verbose (int, default=0) – Verbosity level

Raises:

ValueError – If no gene sets are loaded or if data is not properly initialized

compute_batch(selected_gene_sets: List[str], batch_size: int = 50, **kwargs) → None[source]¶

Process gene sets in batches for memory efficiency.

Parameters:

selected_gene_sets (List[str]) – Gene sets to analyze
batch_size (int, default=50) – Number of gene sets per batch
**kwargs – Additional arguments passed to compute()

compute_median_exp(svd_: TruncatedSVD | PCA | IncrementalPCA, X: ndarray, raw_X_subset: ndarray, gene_set_name: str | None = None) → Tuple[float, ndarray, ndarray][source]¶

Compute median expression for shifted pathway detection.

Parameters:

svd (Union[TruncatedSVD, PCA, IncrementalPCA]) – Fitted decomposition object
X (np.ndarray) – Processed data matrix
raw_X_subset (np.ndarray) – Raw data subset
gene_set_name (Optional[str], default=None) – Name of current gene set

Returns:

median_exp (float) – Median of gene projections on PC1
projections_1 (np.ndarray) – Gene projections on PC1
projections_2 (np.ndarray) – Gene projections on PC2

double_mean_center_matrix(matrix: ndarray) → ndarray[source]¶

Double-center a matrix by removing row and column means.

Parameters:: matrix (np.ndarray) – Input matrix to center
Returns:: centered_matrix – Double-centered matrix
Return type:: np.ndarray

export_results(output_file: str, format: str = 'csv', include_projections: bool = False) → None[source]¶

Export ROMA results to various formats.

Parameters:

output_file (str) – Output file path
format (str, default='csv') – Output format (‘csv’, ‘excel’, ‘h5’)
include_projections (bool, default=False) – Whether to include gene projections

fisher_outlier_filter(gene_outlier_counts: Dict[str, int], gene_pathway_counts: Dict[str, int], outlier_fisher_thr: float = 0.05, min_gene_sets: int = 3) → Dict[str, bool][source]¶

Determine outlier genes using Fisher’s exact test.

Tests whether each gene’s proportion of outlier calls is significantly higher than the global average.

Parameters:

gene_outlier_counts (Dict[str, int]) – Number of gene sets where each gene was flagged as outlier
gene_pathway_counts (Dict[str, int]) – Total number of gene sets containing each gene
outlier_fisher_thr (float, default=0.05) – P-value threshold for Fisher’s test
min_gene_sets (int, default=3) – Minimum gene sets required for outlier consideration

Returns:

gene_outlier_flags – Whether each gene is considered an outlier

Return type:

Dict[str, bool]

fix_pc_sign(GeneScore: ndarray, SampleScore: ndarray, Wei: ndarray | None = None, Mode: str = 'PreferActivation', DefWei: float = 1.0, Thr: float | None = None, Grouping: Dict[str, str] | None = None, ExpMat: ndarray | None = None, CorMethod: str = 'pearson', gene_set_name: str | None = None) → int[source]¶

Determine the correct orientation of principal component.

Parameters:

GeneScore (np.ndarray) – Gene loadings on PC
SampleScore (np.ndarray) – Sample projections on PC
Wei (Optional[np.ndarray], default=None) – Gene weights
Mode (str, default='PreferActivation') – Method for PC orientation
DefWei (float, default=1.0) – Default weight for missing values
Thr (Optional[float], default=None) – Threshold for various methods
Grouping (Optional[Dict[str, str]], default=None) – Sample grouping information
ExpMat (Optional[np.ndarray], default=None) – Expression matrix
CorMethod (str, default='pearson') – Correlation method
gene_set_name (Optional[str], default=None) – Name of current gene set

Returns:

sign – +1 or -1 indicating PC orientation

Return type:

int

get_module_scores(module_name: str) → ndarray | None[source]¶

Get activity scores for a specific module across all samples.

Parameters:: module_name (str) – Name of the gene set/module
Returns:: scores – Activity scores across samples, or None if module not found
Return type:: Optional[np.ndarray]

get_raw_p_values(gene_set_name: str | None = None) → None[source]¶

Calculate raw p-values from null distributions.

Parameters:: gene_set_name (Optional[str], default=None) – Name of current gene set

get_top_genes(module_name: str, n_genes: int = 10, absolute: bool = True) → DataFrame | None[source]¶

Get top contributing genes for a module.

Parameters:

module_name (str) – Name of the gene set/module
n_genes (int, default=10) – Number of top genes to return
absolute (bool, default=True) – Whether to sort by absolute weight values

Returns:

top_genes – DataFrame with gene names and weights

Return type:

Optional[pd.DataFrame]

indexing(adata: AnnData) → None[source]¶

Create index of unique gene names from AnnData object.

Parameters:: adata (sc.AnnData) – Annotated data matrix

limit_outliers_per_geneset(gene_set: List[str], gene_flags: Dict[str, bool], loocv_scores: Dict[str, float], max_outlier_prop: float = 0.5) → Dict[str, bool][source]¶

Limit the proportion of genes flagged as outliers in a gene set.

Parameters:

gene_set (List[str]) – Gene names in the gene set
gene_flags (Dict[str, bool]) – Current outlier flags for genes
loocv_scores (Dict[str, float]) – LOOCV scores for prioritizing outliers
max_outlier_prop (float, default=0.5) – Maximum proportion of outliers allowed

Returns:

gene_flags – Updated outlier flags

Return type:

Dict[str, bool]

load_active_modules_results(path: str) → None[source]¶

Load previously saved ROMA results.

Parameters:: path (str) – Path to saved results directory
Raises:: FileNotFoundError – If results directory doesn’t exist

loocv(subset: AnnData, verbose: int = 0, for_randomset: bool = False) → List[int][source]¶

Perform leave-one-out cross-validation for outlier detection.

Uses vectorized operations to efficiently compute L1 scores with each gene removed.

Parameters:

subset (sc.AnnData) – Subset of data for current gene set
verbose (int, default=0) – Verbosity level
for_randomset (bool, default=False) – Whether this is for random set computation

Returns:

outliers – Indices of detected outlier genes

Return type:

List[int]

orient_pc1(pc1: ndarray, X: ndarray, raw_X_subset: ndarray, gene_set_name: str | None = None) → int[source]¶

Orient first principal component according to configured method.

Parameters:

pc1 (np.ndarray) – First principal component
X (np.ndarray) – Processed data matrix
raw_X_subset (np.ndarray) – Raw data subset for orientation
gene_set_name (Optional[str], default=None) – Name of current gene set

Returns:

correct_sign – +1 or -1 for PC orientation

Return type:

int

p_values_in_frame(assessed_results: Dict[str, GeneSetResult]) → DataFrame[source]¶

Create DataFrame with all p-values and statistics.

Parameters:: assessed_results (Dict[str, GeneSetResult]) – Results with computed p-values
Returns:: df – DataFrame with all statistics
Return type:: pd.DataFrame

process_iteration(sequence: ndarray, idx: ndarray, iteration: int, incremental: bool, partial_fit: bool, algorithm: str) → Tuple[float, float, ndarray, ndarray][source]¶

Process single iteration for null distribution computation.

Parameters:

sequence (np.ndarray) – Sequence of indices to sample from
idx (np.ndarray) – Gene indices
iteration (int) – Current iteration number
incremental (bool) – Whether to use incremental PCA
partial_fit (bool) – Whether to use partial fit
algorithm (str) – SVD algorithm

Returns:

l1 (float) – L1 value for this iteration
median_exp (float) – Median expression for this iteration
null_projections_1 (np.ndarray) – Projections on PC1
null_projections_2 (np.ndarray) – Projections on PC2

randomset_parallel(subsetlist: ndarray, outliers: List[int], verbose: int = 0, prefer_type: str = 'processes', incremental: bool = False, iters: int = 100, partial_fit: bool = False, algorithm: str = 'randomized') → None[source]¶

Compute null distributions using parallel processing.

Parameters:

subsetlist (np.ndarray) – Array of gene names in current set
outliers (List[int]) – Indices of outlier genes
verbose (int, default=0) – Verbosity level
prefer_type (str, default='processes') – Joblib backend preference
incremental (bool, default=False) – Whether to use incremental PCA
iters (int, default=100) – Number of iterations
partial_fit (bool, default=False) – Whether to use partial fit
algorithm (str, default='randomized') – SVD algorithm

read_gmt_to_dict(gmt_path: str) → Dict[str, ndarray][source]¶

Read GMT file and convert to dictionary of gene sets.

Parameters:

gmt_path (str) – Path to GMT format file

Returns:

genesets – Dictionary mapping gene set names to gene arrays

Return type:

Dict[str, np.ndarray]

Raises:

FileNotFoundError – If GMT file doesn’t exist
ValueError – If GMT file format is invalid

robustIncrementalPCA(adata: AnnData, subsetlist: ndarray, outliers: List[int], for_randomset: bool = False, partial_fit: bool = False, batch_size: int = 1000) → Tuple[IncrementalPCA, ndarray][source]¶

Perform robust incremental PCA excluding outlier genes.

Parameters:

adata (sc.AnnData) – Annotated data matrix
subsetlist (np.ndarray) – Gene names to include
outliers (List[int]) – Indices of outlier genes to exclude
for_randomset (bool, default=False) – Whether this is for random set computation
partial_fit (bool, default=False) – Whether to use partial_fit
batch_size (int, default=1000) – Batch size for incremental PCA

Returns:

ipca (IncrementalPCA) – Fitted IncrementalPCA object
X (np.ndarray) – Processed data matrix

robustPCA(adata: AnnData, subsetlist: ndarray, outliers: List[int], for_randomset: bool = False, algorithm: str = 'auto') → Tuple[PCA, ndarray][source]¶

Perform robust PCA excluding outlier genes.

Parameters:

adata (sc.AnnData) – Annotated data matrix
subsetlist (np.ndarray) – Gene names to include
outliers (List[int]) – Indices of outlier genes to exclude
for_randomset (bool, default=False) – Whether this is for random set computation
algorithm (str, default='auto') – PCA algorithm

Returns:

pca (PCA) – Fitted PCA object
X (np.ndarray) – Processed data matrix

robustTruncatedSVD(adata: AnnData, subsetlist: ndarray, outliers: List[int], for_randomset: bool = False, algorithm: str = 'randomized') → Tuple[TruncatedSVD, ndarray][source]¶

save_active_modules_results(path: str, only_active: bool = True, save_adata: bool = True) → None[source]¶

Save ROMA results to disk.

Parameters:

path (str) – Output directory path
only_active (bool, default=True) – Whether to save only active modules
save_adata (bool, default=True) – Whether to save the AnnData object

select_active_modules(q_L1_threshold: float = 0.05, q_Med_Exp_threshold: float = 0.05) → None[source]¶

Select significantly active modules based on q-value thresholds.

Parameters:

q_L1_threshold (float, default=0.05) – Q-value threshold for L1 significance
q_Med_Exp_threshold (float, default=0.05) – Q-value threshold for median expression significance

select_and_sort_gene_sets(selected_geneset_names: List[str]) → List[str][source]¶

Sort gene sets by size for efficient processing.

Parameters:: selected_geneset_names (List[str]) – Names of gene sets to process
Returns:: sorted_names – Gene set names sorted by size
Return type:: List[str]

subsetting(adata: AnnData, geneset: ndarray, verbose: int = 0) → Tuple[AnnData, ndarray][source]¶

Create subset of data for specific gene set.

Parameters:

adata (sc.AnnData) – Full annotated data matrix
geneset (np.ndarray) – Array of gene names to subset
verbose (int, default=0) – Verbosity level

Returns:

subset (sc.AnnData) – Subset of data containing only genes in gene set
subsetlist (np.ndarray) – Array of gene names that were found in the data

summary() → None[source]¶: Print summary of ROMA analysis results.

validate_results() → Dict[str, List[str]][source]¶

Validate ROMA results for potential issues.

Returns:: issues – Dictionary of potential issues found
Return type:: Dict[str, List[str]]

class pyroma.color[source]¶

Bases: object

BOLD = '\x1b[1m'¶

DARKCYAN = '\x1b[36m'¶

END = '\x1b[0m'¶

GREEN = '\x1b[92m'¶

PURPLE = '\x1b[95m'¶

pyroma.integrate_projection_results(global_adata, cell_type, gene_set_results, ct_label)[source]¶

pyroma.pbmc3k() → AnnData[source]¶

10X peripheral blood mononuclear cells (PBMCs).

Loads single-cell RNA-seq data of peripheral blood mononuclear cells (PBMCs) from a healthy donor.

Returns:: adata – Annotated data matrix.
Return type:: AnnData

Example

>>> import pyroma
>>> adata = pyroma.datasets.pbmc3k()
>>> print(adata)

pyroma.pbmc_ifnb() → AnnData[source]¶

10X peripheral blood mononuclear cells (PBMCs) + IFNb stimulation.

Data from Kang et al. 2018.

Loads single-cell RNA-seq data of peripheral blood mononuclear cells (PBMCs) from healthy and stimulated samples.

Returns:: adata – Annotated data matrix.
Return type:: AnnData

Example

>>> import pyroma
>>> adata = pyroma.datasets.pbmc_ifnb()
>>> print(adata)

pyroma.use_hallmarks() → str[source]¶: Return the absolute path to the bundled Hallmark gene sets. Safe for both editable installs and wheels.

pyroma.use_progeny() → str[source]¶: Return the absolute path to the bundled Progeny gene sets. Safe for both editable installs and wheels.

pyroma.use_reactome() → str[source]¶: Return the absolute path to the bundled Reactome gene sets. Safe for both editable installs and wheels.