API Reference

PyROMA: Python implementation of Representation Of Module Activity for single-cell RNA-seq data.

class pyroma.GeneSetResult(subset: AnnData, subsetlist: ndarray, outliers: List[int], nullgenesetsize: int, svd: TruncatedSVD | PCA | IncrementalPCA)[source]

Bases: object

Container for gene set analysis results.

subset

Subset of data for this gene set

Type:

sc.AnnData

subsetlist

Array of gene names in the gene set

Type:

np.ndarray

outliers

Indices of outlier genes

Type:

List[int]

nullgenesetsize

Size of null gene set for comparison

Type:

int

svd

Fitted decomposition object

Type:

Union[TruncatedSVD, PCA, IncrementalPCA]

X

Processed data matrix

Type:

np.ndarray

raw_X_subset

Raw data subset for PC sign correction

Type:

np.ndarray

projections_1

Projections onto first principal component

Type:

np.ndarray

projections_2

Projections onto second principal component

Type:

np.ndarray

nulll1

Null distribution of L1 values

Type:

np.ndarray

null_median_exp

Null distribution of median expression

Type:

np.ndarray

null_projections

Null projections for both components

Type:

np.ndarray

q_value

Adjusted p-value for L1 statistic

Type:

float

med_exp_q_value

Adjusted p-value for median expression

Type:

float

non_adj_p

Raw p-value for L1 statistic

Type:

float

non_adj_med_exp_p

Raw p-value for median expression

Type:

float

test_l1

L1 test statistic

Type:

float

test_median_exp

Median expression test statistic

Type:

float

class pyroma.ROMA[source]

Bases: object

ROMA (Representation Of Module Activity) implementation for single-cell RNA-seq data.

This class implements the ROMA algorithm for quantifying gene set activity in single-cell transcriptomics data using principal component analysis.

adata

Annotated data matrix

Type:

Optional[sc.AnnData]

gmt

Path to GMT file containing gene sets

Type:

Optional[str]

genesets

Dictionary mapping gene set names to gene arrays

Type:

Dict[str, np.ndarray]

results

Results for each analyzed gene set

Type:

Dict[str, GeneSetResult]

null_distributions

Cached null distributions by gene set size

Type:

Dict[int, Tuple[np.ndarray, np.ndarray]]

approx_int

Granularity of null gene set size approximation

Type:

int

min_n_genes

Minimum number of genes required in a gene set

Type:

int

q_L1_threshold

Q-value threshold for L1 significance

Type:

float

q_Med_Exp_threshold

Q-value threshold for median expression significance

Type:

float

approx_size(flag: bool) None[source]

Determine appropriate null gene set size with approximation.

Parameters:

flag (bool) – Whether this is the first gene set

assess_significance(results: Dict[str, GeneSetResult]) Dict[str, GeneSetResult][source]

Compute adjusted p-values using Benjamini-Hochberg procedure.

Parameters:

results (Dict[str, GeneSetResult]) – Results for each gene set

Returns:

results – Results with adjusted p-values

Return type:

Dict[str, GeneSetResult]

compare_conditions(condition_key: str, module_name: str, test: str = 'wilcoxon') Dict[str, Any][source]

Compare module activity between conditions.

Parameters:
  • condition_key (str) – Key in adata.obs containing condition labels

  • module_name (str) – Name of module to compare

  • test (str, default='wilcoxon') – Statistical test to use (‘wilcoxon’, ‘t-test’, ‘mann-whitney’)

Returns:

results – Test results including p-value and effect size

Return type:

Dict[str, Any]

compute(selected_gene_sets: List[str] | str = 'all', parallel: bool = True, incremental: bool = False, iters: int = 100, partial_fit: bool = False, algorithm: str = 'randomized', loocv_on: bool = True, double_mean_centering: bool = False, outlier_fisher_thr: float = 0.05, min_gene_sets: int = 3, max_outlier_prop: float = 0.5, verbose: int = 0) None[source]

Compute ROMA module activity scores for selected gene sets.

Parameters:
  • selected_gene_sets (Union[List[str], str], default='all') – List of gene set names to analyze, or ‘all’ for all gene sets

  • parallel (bool, default=False) – Whether to use parallel processing for null distribution computation

  • incremental (bool, default=False) – Whether to use incremental PCA instead of truncated SVD

  • iters (int, default=100) – Number of iterations for null distribution generation

  • partial_fit (bool, default=False) – Whether to use partial_fit for incremental PCA

  • algorithm (str, default='randomized') – Algorithm for SVD computation (‘randomized’ or ‘arpack’)

  • loocv_on (bool, default=True) – Whether to perform leave-one-out cross-validation for outlier detection

  • double_mean_centering (bool, default=False) – Whether to center data across both samples and genes

  • outlier_fisher_thr (float, default=0.05) – P-value threshold for Fisher’s exact test in outlier detection

  • min_gene_sets (int, default=3) – Minimum number of gene sets a gene must appear in to be considered for outlier removal

  • max_outlier_prop (float, default=0.5) – Maximum proportion of genes that can be removed as outliers per gene set

  • verbose (int, default=0) – Verbosity level

Raises:

ValueError – If no gene sets are loaded or if data is not properly initialized

compute_batch(selected_gene_sets: List[str], batch_size: int = 50, **kwargs) None[source]

Process gene sets in batches for memory efficiency.

Parameters:
  • selected_gene_sets (List[str]) – Gene sets to analyze

  • batch_size (int, default=50) – Number of gene sets per batch

  • **kwargs – Additional arguments passed to compute()

compute_median_exp(svd_: TruncatedSVD | PCA | IncrementalPCA, X: ndarray, raw_X_subset: ndarray, gene_set_name: str | None = None) Tuple[float, ndarray, ndarray][source]

Compute median expression for shifted pathway detection.

Parameters:
  • svd (Union[TruncatedSVD, PCA, IncrementalPCA]) – Fitted decomposition object

  • X (np.ndarray) – Processed data matrix

  • raw_X_subset (np.ndarray) – Raw data subset

  • gene_set_name (Optional[str], default=None) – Name of current gene set

Returns:

  • median_exp (float) – Median of gene projections on PC1

  • projections_1 (np.ndarray) – Gene projections on PC1

  • projections_2 (np.ndarray) – Gene projections on PC2

double_mean_center_matrix(matrix: ndarray) ndarray[source]

Double-center a matrix by removing row and column means.

Parameters:

matrix (np.ndarray) – Input matrix to center

Returns:

centered_matrix – Double-centered matrix

Return type:

np.ndarray

export_results(output_file: str, format: str = 'csv', include_projections: bool = False) None[source]

Export ROMA results to various formats.

Parameters:
  • output_file (str) – Output file path

  • format (str, default='csv') – Output format (‘csv’, ‘excel’, ‘h5’)

  • include_projections (bool, default=False) – Whether to include gene projections

fisher_outlier_filter(gene_outlier_counts: Dict[str, int], gene_pathway_counts: Dict[str, int], outlier_fisher_thr: float = 0.05, min_gene_sets: int = 3) Dict[str, bool][source]

Determine outlier genes using Fisher’s exact test.

Tests whether each gene’s proportion of outlier calls is significantly higher than the global average.

Parameters:
  • gene_outlier_counts (Dict[str, int]) – Number of gene sets where each gene was flagged as outlier

  • gene_pathway_counts (Dict[str, int]) – Total number of gene sets containing each gene

  • outlier_fisher_thr (float, default=0.05) – P-value threshold for Fisher’s test

  • min_gene_sets (int, default=3) – Minimum gene sets required for outlier consideration

Returns:

gene_outlier_flags – Whether each gene is considered an outlier

Return type:

Dict[str, bool]

fix_pc_sign(GeneScore: ndarray, SampleScore: ndarray, Wei: ndarray | None = None, Mode: str = 'PreferActivation', DefWei: float = 1.0, Thr: float | None = None, Grouping: Dict[str, str] | None = None, ExpMat: ndarray | None = None, CorMethod: str = 'pearson', gene_set_name: str | None = None) int[source]

Determine the correct orientation of principal component.

Parameters:
  • GeneScore (np.ndarray) – Gene loadings on PC

  • SampleScore (np.ndarray) – Sample projections on PC

  • Wei (Optional[np.ndarray], default=None) – Gene weights

  • Mode (str, default='PreferActivation') – Method for PC orientation

  • DefWei (float, default=1.0) – Default weight for missing values

  • Thr (Optional[float], default=None) – Threshold for various methods

  • Grouping (Optional[Dict[str, str]], default=None) – Sample grouping information

  • ExpMat (Optional[np.ndarray], default=None) – Expression matrix

  • CorMethod (str, default='pearson') – Correlation method

  • gene_set_name (Optional[str], default=None) – Name of current gene set

Returns:

sign – +1 or -1 indicating PC orientation

Return type:

int

get_module_scores(module_name: str) ndarray | None[source]

Get activity scores for a specific module across all samples.

Parameters:

module_name (str) – Name of the gene set/module

Returns:

scores – Activity scores across samples, or None if module not found

Return type:

Optional[np.ndarray]

get_raw_p_values(gene_set_name: str | None = None) None[source]

Calculate raw p-values from null distributions.

Parameters:

gene_set_name (Optional[str], default=None) – Name of current gene set

get_top_genes(module_name: str, n_genes: int = 10, absolute: bool = True) DataFrame | None[source]

Get top contributing genes for a module.

Parameters:
  • module_name (str) – Name of the gene set/module

  • n_genes (int, default=10) – Number of top genes to return

  • absolute (bool, default=True) – Whether to sort by absolute weight values

Returns:

top_genes – DataFrame with gene names and weights

Return type:

Optional[pd.DataFrame]

indexing(adata: AnnData) None[source]

Create index of unique gene names from AnnData object.

Parameters:

adata (sc.AnnData) – Annotated data matrix

limit_outliers_per_geneset(gene_set: List[str], gene_flags: Dict[str, bool], loocv_scores: Dict[str, float], max_outlier_prop: float = 0.5) Dict[str, bool][source]

Limit the proportion of genes flagged as outliers in a gene set.

Parameters:
  • gene_set (List[str]) – Gene names in the gene set

  • gene_flags (Dict[str, bool]) – Current outlier flags for genes

  • loocv_scores (Dict[str, float]) – LOOCV scores for prioritizing outliers

  • max_outlier_prop (float, default=0.5) – Maximum proportion of outliers allowed

Returns:

gene_flags – Updated outlier flags

Return type:

Dict[str, bool]

load_active_modules_results(path: str) None[source]

Load previously saved ROMA results.

Parameters:

path (str) – Path to saved results directory

Raises:

FileNotFoundError – If results directory doesn’t exist

loocv(subset: AnnData, verbose: int = 0, for_randomset: bool = False) List[int][source]

Perform leave-one-out cross-validation for outlier detection.

Uses vectorized operations to efficiently compute L1 scores with each gene removed.

Parameters:
  • subset (sc.AnnData) – Subset of data for current gene set

  • verbose (int, default=0) – Verbosity level

  • for_randomset (bool, default=False) – Whether this is for random set computation

Returns:

outliers – Indices of detected outlier genes

Return type:

List[int]

orient_pc1(pc1: ndarray, X: ndarray, raw_X_subset: ndarray, gene_set_name: str | None = None) int[source]

Orient first principal component according to configured method.

Parameters:
  • pc1 (np.ndarray) – First principal component

  • X (np.ndarray) – Processed data matrix

  • raw_X_subset (np.ndarray) – Raw data subset for orientation

  • gene_set_name (Optional[str], default=None) – Name of current gene set

Returns:

correct_sign – +1 or -1 for PC orientation

Return type:

int

p_values_in_frame(assessed_results: Dict[str, GeneSetResult]) DataFrame[source]

Create DataFrame with all p-values and statistics.

Parameters:

assessed_results (Dict[str, GeneSetResult]) – Results with computed p-values

Returns:

df – DataFrame with all statistics

Return type:

pd.DataFrame

process_iteration(sequence: ndarray, idx: ndarray, iteration: int, incremental: bool, partial_fit: bool, algorithm: str) Tuple[float, float, ndarray, ndarray][source]

Process single iteration for null distribution computation.

Parameters:
  • sequence (np.ndarray) – Sequence of indices to sample from

  • idx (np.ndarray) – Gene indices

  • iteration (int) – Current iteration number

  • incremental (bool) – Whether to use incremental PCA

  • partial_fit (bool) – Whether to use partial fit

  • algorithm (str) – SVD algorithm

Returns:

  • l1 (float) – L1 value for this iteration

  • median_exp (float) – Median expression for this iteration

  • null_projections_1 (np.ndarray) – Projections on PC1

  • null_projections_2 (np.ndarray) – Projections on PC2

randomset_parallel(subsetlist: ndarray, outliers: List[int], verbose: int = 0, prefer_type: str = 'processes', incremental: bool = False, iters: int = 100, partial_fit: bool = False, algorithm: str = 'randomized') None[source]

Compute null distributions using parallel processing.

Parameters:
  • subsetlist (np.ndarray) – Array of gene names in current set

  • outliers (List[int]) – Indices of outlier genes

  • verbose (int, default=0) – Verbosity level

  • prefer_type (str, default='processes') – Joblib backend preference

  • incremental (bool, default=False) – Whether to use incremental PCA

  • iters (int, default=100) – Number of iterations

  • partial_fit (bool, default=False) – Whether to use partial fit

  • algorithm (str, default='randomized') – SVD algorithm

read_gmt_to_dict(gmt_path: str) Dict[str, ndarray][source]

Read GMT file and convert to dictionary of gene sets.

Parameters:

gmt_path (str) – Path to GMT format file

Returns:

genesets – Dictionary mapping gene set names to gene arrays

Return type:

Dict[str, np.ndarray]

Raises:
robustIncrementalPCA(adata: AnnData, subsetlist: ndarray, outliers: List[int], for_randomset: bool = False, partial_fit: bool = False, batch_size: int = 1000) Tuple[IncrementalPCA, ndarray][source]

Perform robust incremental PCA excluding outlier genes.

Parameters:
  • adata (sc.AnnData) – Annotated data matrix

  • subsetlist (np.ndarray) – Gene names to include

  • outliers (List[int]) – Indices of outlier genes to exclude

  • for_randomset (bool, default=False) – Whether this is for random set computation

  • partial_fit (bool, default=False) – Whether to use partial_fit

  • batch_size (int, default=1000) – Batch size for incremental PCA

Returns:

  • ipca (IncrementalPCA) – Fitted IncrementalPCA object

  • X (np.ndarray) – Processed data matrix

robustPCA(adata: AnnData, subsetlist: ndarray, outliers: List[int], for_randomset: bool = False, algorithm: str = 'auto') Tuple[PCA, ndarray][source]

Perform robust PCA excluding outlier genes.

Parameters:
  • adata (sc.AnnData) – Annotated data matrix

  • subsetlist (np.ndarray) – Gene names to include

  • outliers (List[int]) – Indices of outlier genes to exclude

  • for_randomset (bool, default=False) – Whether this is for random set computation

  • algorithm (str, default='auto') – PCA algorithm

Returns:

  • pca (PCA) – Fitted PCA object

  • X (np.ndarray) – Processed data matrix

robustTruncatedSVD(adata: AnnData, subsetlist: ndarray, outliers: List[int], for_randomset: bool = False, algorithm: str = 'randomized') Tuple[TruncatedSVD, ndarray][source]
save_active_modules_results(path: str, only_active: bool = True, save_adata: bool = True) None[source]

Save ROMA results to disk.

Parameters:
  • path (str) – Output directory path

  • only_active (bool, default=True) – Whether to save only active modules

  • save_adata (bool, default=True) – Whether to save the AnnData object

select_active_modules(q_L1_threshold: float = 0.05, q_Med_Exp_threshold: float = 0.05) None[source]

Select significantly active modules based on q-value thresholds.

Parameters:
  • q_L1_threshold (float, default=0.05) – Q-value threshold for L1 significance

  • q_Med_Exp_threshold (float, default=0.05) – Q-value threshold for median expression significance

select_and_sort_gene_sets(selected_geneset_names: List[str]) List[str][source]

Sort gene sets by size for efficient processing.

Parameters:

selected_geneset_names (List[str]) – Names of gene sets to process

Returns:

sorted_names – Gene set names sorted by size

Return type:

List[str]

subsetting(adata: AnnData, geneset: ndarray, verbose: int = 0) Tuple[AnnData, ndarray][source]

Create subset of data for specific gene set.

Parameters:
  • adata (sc.AnnData) – Full annotated data matrix

  • geneset (np.ndarray) – Array of gene names to subset

  • verbose (int, default=0) – Verbosity level

Returns:

  • subset (sc.AnnData) – Subset of data containing only genes in gene set

  • subsetlist (np.ndarray) – Array of gene names that were found in the data

summary() None[source]

Print summary of ROMA analysis results.

validate_results() Dict[str, List[str]][source]

Validate ROMA results for potential issues.

Returns:

issues – Dictionary of potential issues found

Return type:

Dict[str, List[str]]

class pyroma.color[source]

Bases: object

BOLD = '\x1b[1m'
DARKCYAN = '\x1b[36m'
END = '\x1b[0m'
GREEN = '\x1b[92m'
PURPLE = '\x1b[95m'
pyroma.integrate_projection_results(global_adata, cell_type, gene_set_results, ct_label)[source]
pyroma.pbmc3k() AnnData[source]

10X peripheral blood mononuclear cells (PBMCs).

Loads single-cell RNA-seq data of peripheral blood mononuclear cells (PBMCs) from a healthy donor.

Returns:

adata – Annotated data matrix.

Return type:

AnnData

Example

>>> import pyroma
>>> adata = pyroma.datasets.pbmc3k()
>>> print(adata)
pyroma.pbmc_ifnb() AnnData[source]

10X peripheral blood mononuclear cells (PBMCs) + IFNb stimulation.

Data from Kang et al. 2018.

Loads single-cell RNA-seq data of peripheral blood mononuclear cells (PBMCs) from healthy and stimulated samples.

Returns:

adata – Annotated data matrix.

Return type:

AnnData

Example

>>> import pyroma
>>> adata = pyroma.datasets.pbmc_ifnb()
>>> print(adata)
pyroma.use_hallmarks() str[source]

Return the absolute path to the bundled Hallmark gene sets. Safe for both editable installs and wheels.

pyroma.use_progeny() str[source]

Return the absolute path to the bundled Progeny gene sets. Safe for both editable installs and wheels.

pyroma.use_reactome() str[source]

Return the absolute path to the bundled Reactome gene sets. Safe for both editable installs and wheels.