API Reference

This section provides detailed documentation for all modules and functions in scCellFie.

scCellFie

Communication

sccellfie.communication.compute_local_colocalization_scores(adata, var1, var2, neighbors_radius, method='pairwise_concordance', spatial_key='X_spatial', min_neighbors=3, threshold1=None, threshold2=None, score_key=None, inplace=True)[source]

Computes local colocalization scores between two variables for each spatial spot.

Parameters:
  • adata (AnnData) – AnnData object containing expression data and spatial coordinates.

  • var1 (str) – Name of first variable to analyze.

  • var2 (str) – Name of second variable to analyze.

  • neighbors_radius (float) – Radius for assigning a neighborhood of a spot (neighbors within this radius are considered, and the sport is the center).

  • method (str, optional (default: 'pairwise_concordance')) – Method to compute colocalization: - ‘correlation’: Local Pearson correlation between var1 and var2 across spot & neighbors. - ‘concordance’: Compute the fraction of spots where both genes are expressed above their thresholds. - ‘pairwise_concordance’: Compute the fraction of spot pairs in the neighborhood where var1 and var2 are expressed above their thresholds in sport 1 and 2, respectively. - ‘cosine’: Local cosine similarity between var1 and var2 across spot & neighbors. - ‘weighted_gmean’: Local weighted geometric mean across spot & neighbors (weighted by distance). - ‘regularized_weighted_gmean’: Local regularized and weighted geometric mean across spot & neighbors (weighted by distance).

  • spatial_key (str, optional (default: 'spatial')) – Key in adata.obsm containing spatial coordinates

  • min_neighbors (int, optional (default: 3)) – Minimum number of neighbors required for computing score. If less neighbors are found, score is NaN.

  • threshold1 (float, optional (default: None)) – Threshold for var1. If None, the mean of var1 is used.

  • threshold2 (float, optional (default: None)) – Threshold for var2. If None, the mean of var2 is used.

  • score_key (str, optional (default: None)) – Key to store the computed colocalization scores in adata.obs. If None, a default key is used.

  • inplace (bool, optional (default: True)) – If True, the computed scores are added to adata.obs. Otherwise, the scores are returned as a numpy array.

Returns:

Array of colocalization scores for each spot

Return type:

numpy.ndarray

sccellfie.communication.compute_communication_scores(adata, groupby, var_pairs, communication_score='gmean', agg_func='mean', layer=None, ligand_threshold=0, receptor_threshold=0)[source]

Computes communication scores between pairs of features or variables (normally representing ligand-receptor pairs) across different cell types.

Parameters:
  • adata (AnnData) – AnnData object containing expression data and grouping information

  • groupby (str) – Column in adata.obs for grouping cells to aggregate expression.

  • var_pairs (list of tuples) – List of (var1, var2) pairs (normally representing ligand-receptor pairs).

  • communication_score (str, default='gmean') –

    Method to compute communication scores. Options are:
    • ’gmean’: geometric mean (sqrt(x * y))

    • ’product’: simple multiplication (x * y)

    • ’mean’: arithmetic mean ((x + y) / 2)

  • agg_func (str, default='mean') – Aggregation function for aggregating expression values across cells. Options are ‘mean’, ‘median’, ‘25p’ (25th percentile), ‘75p’ (75th percentile), ‘trimean’ (0.5*Q2 + 0.25(Q1+Q3)), and ‘topmean’.

  • layer (str, optional) – Layer in adata to use for aggregation. If None, the main expression matrix adata.X is used.

  • ligand_threshold (float, default=0) – Threshold for calculating the fraction of cells expressing the ligand. Only cells with expression above this threshold are considered as expressing the ligand.

  • receptor_threshold (float, default=0) – Threshold for calculating the fraction of cells expressing the receptor. Only cells with expression above this threshold are considered as expressing the receptor.

Returns:

ccc_scores – DataFrame containing the communication scores between cell types for each variable pair. Columns are:

  • sender_celltype: type of the sender cell

  • receiver_celltype: type of the receiver cell

  • ligand: name of the ligand

  • receptor: name of the receptor

  • score: communication score

  • ligand_fraction: fraction of sender cells expressing the ligand

  • receptor_fraction: fraction of receiver cells expressing the receptor

Return type:

pandas.DataFrame

Datasets

sccellfie.datasets.retrieve_ensembl2symbol_data(filename=None, organism='human')[source]

Retrieves a dictionary mapping Ensembl IDs to gene symbols for a given organism.

Parameters:
  • filename (str, optional (default: None)) – The file path to a custom CSV file containing Ensembl IDs and gene symbols.

  • organism (str, optional (default: 'human')) – The organism to retrieve data for. Choose ‘human’ or ‘mouse’.

Returns:

ensembl2symbol – A dictionary mapping Ensembl IDs to gene symbols

Return type:

dict

sccellfie.datasets.load_sccellfie_database(organism='human', task_folder=None, rxn_info_filename=None, task_info_filename=None, task_by_rxn_filename=None, task_by_gene_filename=None, rxn_by_gene_filename=None, thresholds_filename=None)[source]

Loads files of the metabolic task database from either a local folder, individual file paths, or predefined URLs.

Parameters:
  • organism (str, optional (default: 'human')) – The organism to retrieve data for. Choose ‘human’ or ‘mouse’. Used when loading from URLs.

  • task_folder (str, optional (default: None)) – The local folder path containing CellFie data files. If provided, this takes priority.

  • rxn_info_filename (str, optional (default: None)) – Full path for reaction information JSON file.

  • task_info_filename (str, optional (default: None)) – Full path for task information CSV file.

  • task_by_rxn_filename (str, optional (default: None)) – Full path for task by reaction CSV file.

  • task_by_gene_filename (str, optional (default: None)) – Full path for task by gene CSV file.

  • rxn_by_gene_filename (str, optional (default: None)) – Full path for reaction by gene CSV file.

  • thresholds_filename (str, optional (default: None)) – Full path for thresholds CSV file.

Returns:

data – A dictionary containing the loaded data frames and information. Keys are ‘rxn_info’, ‘task_info’, ‘task_by_rxn’, ‘task_by_gene’, ‘rxn_by_gene’, ‘thresholds’, and ‘organism’. Examples of dataframes can be found at https://github.com/earmingol/scCellFie/raw/refs/heads/main/task_data/homo_sapiens/

Return type:

dict

Expression

sccellfie.expression.agg_expression_cells(adata, groupby, layer=None, gene_symbols=None, agg_func='mean', top_percent=10, exclude_zeros=False, use_raw=False, threshold=None)[source]

Aggregates gene expression data for specified cell groups in an AnnData object.

Parameters:
  • adata (AnnData) – An AnnData object containing the expression data to be aggregated.

  • groupby (str) – The key in the adata.obs DataFrame to group by. This could be any categorical annotation of cells (e.g., cell type, condition).

  • layer (str, optional (default: None)) – The name of the layer in adata to use for aggregation. If None, the main expression matrix adata.X is used.

  • gene_symbols (str or list, optional (default: None)) – Gene names to include in the aggregation. If a string is provided, it is converted to a single-element list. If None, all genes are included.

  • agg_func (str, optional (default: 'mean')) – The aggregation function to apply. Options are ‘mean’, ‘median’, ‘25p’ (25th percentile), ‘75p’ (75th percentile), ‘trimean’ (0.5*Q2 + 0.25(Q1+Q3)), ‘topmean’ (computed among the top top_percent`% of values), and ‘fraction_above’ (fraction of cells above threshold) The function must be one of the keys in the `AGG_FUNC dictionary.

  • top_percent (float, optional (default: 10)) – The percentage of top values to consider when agg_func is ‘topmean’. Ranging from 0 to 100.

  • exclude_zeros (bool, optional (default: False)) – Whether to exclude zeros when aggregating the values.

  • use_raw (bool, optional (default: False)) – Whether to use the data in adata.raw.X (True) or in adata.X (False).

  • threshold (float, optional (default: None)) – Expression threshold used when agg_func is ‘fraction_above’. Represents the minimum expression value for a cell to be considered as expressing the gene.

Returns:

agg_expression – A pandas.DataFrame where columns correspond to genes and rows correspond to the unique categories in groupby. Each cell in the DataFrame contains the aggregated expression value for the corresponding gene and group.

Return type:

pandas.DataFrame

Raises:

AssertionError – If the provided agg_func is not a valid key in AGG_FUNC.

Notes

This function is used to compute summary statistics of gene expression data across different groups of cells. It is useful for exploring expression patterns in different cell types or conditions.

The function relies on the groupby parameter in adata.obs to define the groups of cells for which the expression data will be aggregated.

sccellfie.expression.top_mean(x, axis, percent=10)[source]

Computes the mean of the top x% values along the specified axis of a matrix, handling NaN values.

Parameters:
  • x (numpy.ndarray) – The input matrix containing the data to be aggregated.

  • axis (int) – The axis along which to compute the mean. Use 0 for columns, 1 for rows.

  • percent (float, (default: 10)) – The percentage of top values to consider, ranging from 0 to 100. For example, 10 would compute the mean of the top 10% of values.

Returns:

An array containing the mean of the top x% values for each row or column, depending on the specified axis. The shape of the output array will be (n_rows,) if axis=1, or (n_columns,) if axis=0.

Return type:

numpy.ndarray

sccellfie.expression.fraction_above_threshold(x, axis, threshold=0)[source]

Computes the fraction of values above a threshold along the specified axis.

Parameters:
  • x (numpy.ndarray) – The input matrix containing the data.

  • axis (int) – The axis along which to compute the fraction. Use 0 for columns, 1 for rows.

  • threshold (float, (default: 0)) – The threshold value above which to count values.

Returns:

An array containing the fraction (between 0 and 1) of values above threshold.

Return type:

numpy.ndarray

sccellfie.expression.smooth_expression_knn(adata, key_added='smoothed_X', neighbors_key='neighbors', mode='connectivity', alpha=0.33, n_chunks=None, chunk_size=None, use_raw=False, disable_pbar=False)[source]

Smooths expression values based on KNNs of single cells using Scanpy.

Parameters:
  • adata (AnnData object) – Annotated data matrix containing the expression data and nearest neighbor graph.

  • key_added (str, optional (default: 'smoothed_X')) – The key in adata.layers where the smoothed expression matrix will be stored.

  • neighbors_key (str, optional (default: 'neighbors')) – The key in adata.uns where the information about the pre-run KNN analysis was stored. This key points to a dictionary containing the ‘connectivities_key’, ‘distances_key’, and ‘params’ from the analysis.

  • mode (str, optional (default: 'connectivity')) – The mode for calculating the smoothing matrix. Can be either ‘adjacency’ or ‘connectivity’.

  • alpha (float, optional (default: 0.33)) – The weight or fraction of the smoothed expression to use in the final expression matrix. The final expression matrix is computed as (1 - alpha) * X + alpha * (S @ X), where X is the original expression matrix and S is the smoothed matrix.

  • n_chunks (int, optional (default: None)) – The number of chunks to split the cells into for processing. If not provided, chunk_size is used.

  • chunk_size (int, optional (default: None)) – The size of each chunk of cells to process. If not provided, n_chunks is used.

  • use_raw (bool, optional (default: False)) – Whether to use the raw data stored in adata.raw.X.

  • disable_pbar (bool, optional (default: False)) – Whether to disable the progress bar.

Returns:

The smoothed expression matrix is stored in adata.layers[key_added].

Return type:

None

Notes

This function smoothes the expression values of single cells based on their K-nearest neighbors (KNNs) using the Scanpy package. The smoothing is performed by calculating a smoothing matrix S based on the nearest neighbor graph and then computing the smoothed expression as (1 - alpha) * X + alpha * (S @ X), where X is the original expression matrix.

The smoothing is performed in chunks to reduce memory usage. The number of chunks or the chunk size can be specified using the n_chunks or chunk_size parameters, respectively.

The smoothed expression matrix is stored in adata.layers[key_added].

sccellfie.expression.get_global_mean_threshold(adata, lower_bound=1e-05, upper_bound=None, exclude_zeros=False, use_raw=False)[source]

Obtains the global mean threshold for each gene in a AnnData object.

Parameters:
  • adata (AnnData object) – Annotated data matrix.

  • lower_bound (float or pandas.DataFrame, optional (default: 1e-5)) – Lower bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes

  • upper_bound (float or pandas.DataFrame, optional (default: None)) – Upper bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes

  • exclude_zeros (bool, optional (default: False)) – Whether to exclude zeros when computing the threshold.

  • use_raw (bool, optional (default: False)) – Whether to use the raw data stored in adata.raw.X.

Returns:

thresholds – A pandas.DataFrame object with the global mean threshold for each gene.

Return type:

pandas.DataFrame

sccellfie.expression.get_global_trimean_threshold(adata, lower_bound=1e-05, upper_bound=None, exclude_zeros=False, use_raw=False)[source]

Obtains the global Tukey’s trimean threshold for each gene in a AnnData object.

Parameters:
  • adata (AnnData object) – Annotated data matrix.

  • lower_bound (float or pandas.DataFrame, optional (default: 1e-5)) – Lower bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes as the adata object.

  • upper_bound (float or pandas.DataFrame, optional (default: None)) – Upper bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes as the adata object.

  • exclude_zeros (bool, optional (default: False)) – Whether to exclude zeros when computing the threshold.

  • use_raw (bool, optional (default: False)) – Whether to use the raw data stored in adata.raw.X.

Returns:

thresholds – A pandas.DataFrame object with the global Tukey’s trimean threshold for each gene.

Return type:

pandas.DataFrame

sccellfie.expression.get_local_mean_threshold(adata, lower_bound=1e-05, upper_bound=None, exclude_zeros=False, use_raw=False)[source]

Obtains the local mean threshold for each gene in a AnnData object.

Parameters:
  • adata (AnnData object) – Annotated data matrix.

  • lower_bound (float or pandas.DataFrame, optional (default: 1e-5)) – Lower bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes as the adata object.

  • upper_bound (float or pandas.DataFrame, optional (default: None)) – Upper bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes as the adata object.

  • exclude_zeros (bool, optional (default: False)) – Whether to exclude zeros when computing the threshold.

  • use_raw (bool, optional (default: False)) – Whether to use the raw data stored in adata.raw.X.

Returns:

thresholds – A pandas.DataFrame object with the local mean threshold for each gene.

Return type:

pandas.DataFrame

sccellfie.expression.get_global_percentile_threshold(adata, percentile=0.75, lower_bound=1e-05, upper_bound=None, exclude_zeros=False, use_raw=False)[source]

Obtains the global percentile threshold for each gene in a AnnData object.

Parameters:
  • adata (AnnData object) – Annotated data matrix.

  • percentile (float or list of floats, optional (default: 0.75)) – Percentile(s) to compute the threshold.

  • lower_bound (float or pandas.DataFrame, optional (default: 1e-5)) – Lower bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes as the adata object.

  • upper_bound (float or pandas.DataFrame, optional (default: None)) – Upper bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes as the adata object.

  • exclude_zeros (bool, optional (default: False)) – Whether to exclude zeros when computing the threshold.

  • use_raw (bool, optional (default: False)) – Whether to use the raw data stored in adata.raw.X.

Returns:

thresholds – A pandas.DataFrame object with the global percentile threshold for each gene.

Return type:

pandas.DataFrame

sccellfie.expression.get_local_percentile_threshold(adata, percentile=0.75, lower_bound=1e-05, upper_bound=None, exclude_zeros=False, use_raw=False)[source]

Obtains the local percentile threshold for each gene in a AnnData object.

Parameters:
  • adata (AnnData object) – Annotated data matrix.

  • percentile (float or list of floats, optional (default: 0.75)) – Percentile(s) to compute the threshold.

  • lower_bound (float or pandas.DataFrame, optional (default: 1e-5)) – Lower bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes as the adata object.

  • upper_bound (float or pandas.DataFrame, optional (default: None)) – Upper bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes as the adata object.

  • exclude_zeros (bool, optional (default: False)) – Whether to exclude zeros when computing the threshold.

  • use_raw (bool, optional (default: False)) – Whether to use the raw data stored in adata.raw.X.

Returns:

thresholds – A pandas.DataFrame object with the local percentile threshold for each gene.

Return type:

pandas.DataFrame

sccellfie.expression.get_local_trimean_threshold(adata, lower_bound=1e-05, upper_bound=None, exclude_zeros=False, use_raw=False)[source]

Obtains the local Tukey’s trimean threshold for each gene in a AnnData object.

Parameters:
  • adata (AnnData object) – Annotated data matrix.

  • lower_bound (float or pandas.DataFrame, optional (default: 1e-5)) – Lower bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes as the adata object.

  • upper_bound (float or pandas.DataFrame, optional (default: None)) – Upper bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes as the adata object.

  • exclude_zeros (bool, optional (default: False)) – Whether to exclude zeros when computing the threshold.

  • use_raw (bool, optional (default: False)) – Whether to use the raw data stored in adata.raw.X.

Returns:

thresholds – A pandas.DataFrame object with the local Tukey’s trimean threshold for each gene.

Return type:

pandas.DataFrame

sccellfie.expression.get_sccellfie_dataset_threshold(adata, gene_set=None, organism='human', cell_mask=None, layer=None, use_raw=False, target_sum=10000, n_counts_key=None, chunk_size=100000, reservoir_size=5000000, percentiles=(10, 25, 50, 75, 90, 95), lower_percentile=25, upper_percentile=75, random_state=None, verbose=True, return_stats=False)[source]

Computes a dataset-wise sccellfie_threshold per metabolic gene by streaming the AnnData in chunks. Faithful port of the atlas-based threshold script that produced the default Thresholds.csv, generalized to a single (possibly backed) AnnData.

Pipeline per chunk:
  1. CP10k-normalize using a per-cell library size (obs column or computed from the full chunk).

  2. Subset to the corrected metabolic-gene columns (after applying CORRECT_GENES[organism]).

  3. Accumulate per-gene sum, non-zero cell count, and max.

  4. Stream non-zero normalized values into a reservoir sample for global percentiles.

The final threshold rule matches the original script (with configurable bounds):

if max > P_lower or max == 0: threshold = clip(nonzero_mean, P_lower, P_upper) else: threshold = nonzero_mean

where P_lower / P_upper default to P25 / P75 (the original atlas behavior) and are controlled by lower_percentile / upper_percentile.

Parameters:
  • adata (AnnData) – Annotated data matrix. May be backed (sc.read_h5ad(..., backed='r')); chunks are materialized one at a time.

  • gene_set (list, set, pandas.Index, str or None, optional (default: None)) – Metabolic gene list. None loads the default gene list from the scCellFie database for organism. A string ending in .json is treated as the path to a JSON file containing a list of gene symbols.

  • organism (str, optional (default: 'human')) – Used to select the CORRECT_GENES rename map and, if gene_set is None, the scCellFie database to load metabolic genes from. Currently 'human' or 'mouse'.

  • cell_mask (array-like, str or None, optional (default: None)) – Restricts the computation to a subset of cells. Accepts a boolean/integer array, a column name in adata.obs, or a pandas.Series indexed by cell names.

  • layer (str or None, optional (default: None)) – Read from adata.layers[layer] instead of adata.X. Mutually exclusive with use_raw.

  • use_raw (bool, optional (default: False)) – Read from adata.raw.X. Mutually exclusive with layer.

  • target_sum (float or None, optional (default: 10_000)) – Target library size for CP-normalization. Pass None to skip normalization (e.g. when the input values are already on the desired scale).

  • n_counts_key (str or None, optional (default: None)) – Column in adata.obs containing per-cell totals. If None, auto-detect among ('total_counts', 'n_counts', 'raw_sum', 'nCount_RNA') and otherwise compute per-cell sums from the full-matrix chunk before gene subsetting.

  • chunk_size (int, optional (default: 100_000)) – Number of cells processed per chunk.

  • reservoir_size (int, optional (default: 5_000_000)) – Size of the reservoir used to estimate global percentiles of non-zero normalized values. Memory cost is reservoir_size * 4B (float32).

  • percentiles (tuple of int, optional (default: (10, 25, 50, 75, 90, 95))) – Percentiles to report in the returned stats. Always merged with {lower_percentile, upper_percentile} so the rule’s bounds are also available for inspection.

  • lower_percentile (int or float, optional (default: 25 and 75)) – Percentile bounds used by the clip rule. The threshold for each gene is clip(nonzero_mean, P_lower, P_upper) when the gene’s max value exceeds P_lower or is zero (the low-expression escape); otherwise the raw nonzero_mean is used. Must satisfy 0 <= lower_percentile < upper_percentile <= 100. Defaults reproduce the original atlas-derived sccellfie_threshold exactly.

  • upper_percentile (int or float, optional (default: 25 and 75)) – Percentile bounds used by the clip rule. The threshold for each gene is clip(nonzero_mean, P_lower, P_upper) when the gene’s max value exceeds P_lower or is zero (the low-expression escape); otherwise the raw nonzero_mean is used. Must satisfy 0 <= lower_percentile < upper_percentile <= 100. Defaults reproduce the original atlas-derived sccellfie_threshold exactly.

  • random_state (int or None, optional (default: None)) – Seed for the reservoir sampler.

  • verbose (bool, optional (default: True)) – If True, print progress via tqdm.

  • return_stats (bool, optional (default: False)) – If True, also return a dict with intermediate statistics.

Returns:

  • thresholds (pandas.DataFrame) – A DataFrame indexed by metabolic gene symbol with a single column 'sccellfie_threshold'. Ready to pass to compute_gene_scores (which selects the first column positionally).

  • stats (dict, only if return_stats=True) – Dict with keys percentiles, sum_per_gene, nnz_per_gene, max_per_gene, mean, nonzero_mean, n_cells, n_values_seen, reservoir_size_used.

sccellfie.expression.set_manual_threshold(adata, threshold)[source]

Sets a threshold manually for each gene in a AnnData object.

Parameters:
  • adata (AnnData object) – Annotated data matrix.

  • threshold (float or list of floats) – Threshold(s) to be set for each gene. If a list is passed it must have the same number of elements as genes in adata, and in the same order.

Returns:

thresholds – A pandas.DataFrame object with the manual threshold for each gene.

Return type:

pandas.DataFrame

External

sccellfie.external.sccellfie_to_tensor(preprocessed_db, sample_key, celltype_key, score_type='metabolic_tasks', min_cells_per_group=1, agg_func='trimean', layer=None, gene_symbols=None, top_percent=10, exclude_zeros=False, use_raw=False, threshold=None, order_labels=None, sort_elements=True, context_order=None, fill_value=nan, verbose=True)[source]

Converts scCellFie scores to format compatible with cell2cell’s PreBuiltTensor constructor.

This function builds a 3D tensor with dimensions: [Contexts/Samples, Cell Types, Metabolic Features]

Parameters:
  • preprocessed_db (dict) – Output from run_sccellfie_pipeline containing ‘adata’ with metabolic_tasks and/or reactions attributes.

  • sample_key (str) – Column name in adata.obs for grouping by samples/contexts.

  • celltype_key (str) – Column name in adata.obs for cell type annotations.

  • score_type (str, optional (default: 'metabolic_tasks')) – Which scCellFie scores to use. Options: ‘metabolic_tasks’, ‘reactions’.

  • min_cells_per_group (int, optional (default: 1)) – Minimum number of cells required per group (sample x celltype) to be included in analysis.

  • agg_func (str, optional (default: 'trimean')) – Aggregation function to apply within cell groups. Options: ‘mean’, ‘median’, ‘25p’, ‘75p’, ‘trimean’, ‘topmean’, ‘fraction_above’.

  • layer (str, optional (default: None)) – Layer name to use for aggregation. If None, uses the main .X matrix.

  • gene_symbols (str or list, optional (default: None)) – Specific features to include in analysis. If None, all features are used.

  • top_percent (float, optional (default: 10)) – Percentage of top values for ‘topmean’ aggregation (0-100).

  • exclude_zeros (bool, optional (default: False)) – Whether to exclude zeros when aggregating values.

  • use_raw (bool, optional (default: False)) – Whether to use raw data for aggregation.

  • threshold (float, optional (default: None)) – Expression threshold for ‘fraction_above’ aggregation.

  • order_labels (list, optional (default: None)) – Labels for each dimension of the tensor. Default: [‘Contexts’, ‘Cell Types’, ‘Metabolic Features’]

  • sort_elements (bool, optional (default: True)) – Whether to alphabetically sort elements in each dimension.

  • context_order (list, optional (default: None)) – Custom order for contexts. If provided, contexts won’t be sorted.

  • fill_value (float, optional (default: numpy.nan)) – Value to fill when a feature or cell type is missing in a context.

  • verbose (bool, optional (default: True)) – Whether to print information about the analysis.

Returns:

prebuilt_tensor_args – A dictionary containing all arguments needed for PreBuiltTensor constructor: - ‘tensor’: numpy array with shape (n_contexts, n_celltypes, n_features) - ‘order_names’: list of lists with names for each dimension - ‘order_labels’: list of dimension labels - ‘mask’: mask for missing values (if applicable) - ‘loc_nans’: locations of NaN values

Return type:

dict

Notes

This function aggregates single-cell metabolic scores into cell type-level summaries across different contexts (samples, conditions, timepoints, etc.) and creates a tensor suitable for tensor decomposition analysis.

The aggregation is performed using scCellFie’s robust aggregation methods, which handle various statistical measures and can exclude zeros or use specific thresholds.

Examples

>>> # Convert scCellFie metabolic tasks to tensor format
>>> tensor_args = sccellfie_to_tensor(
...     preprocessed_db,
...     sample_key='condition',
...     celltype_key='cell_type',
...     score_type='metabolic_tasks',
...     agg_func='mean'
... )
>>>
>>> # Create PreBuiltTensor
>>> from cell2cell.tensor import PreBuiltTensor
>>> tensor = PreBuiltTensor(**tensor_args)
sccellfie.external.quick_markers(adata, cluster_key, cell_groups=None, layer=None, n_markers=10, fdr=0.01, express_cut=0.9, r_output=False)[source]

Identifies top N markers for each cluster in an AnnData object using a TF-IDF-based strategy. Implemented as in the SoupX library for R.

Parameters:
  • adata (AnnData) – Annotated data matrix from Scanpy.

  • cluster_key (str) – Key in adata.obs for the cluster labels.

  • cell_groups (list, optional (default: None)) – List of cell groups to be compared in the analysis.

  • layer (str, optional (default: None)) – Layer to use for the analysis. If None, uses adata.X.

  • n_markers (int, optional (default: 10)) – Number of marker genes to return per cluster.

  • fdr (float, optional (default: 0.01)) – False discovery rate for the hypergeometric test.

  • express_cut (float, optional (default: 0.9)) – Value above which a gene is considered expressed.

  • r_output (bool, optional (default: False)) – Whether reporting the same exact column names as the SoupX version.

Returns:

markers – A pandas.DataFrame with top N markers for each cluster and their statistics.

Return type:

pandas.DataFrame

sccellfie.external.filter_tfidf_markers(df, tf_col='tf', idf_col='idf', tfidf_threshold=None, tfidf_col='tf_idf', tf_ratio=None, second_best_tf_col='second_best_tf', group_col='cluster', second_best_group_col='second_best_cluster')[source]

Filters the top N markers for each cluster based on a hyperbolic curve fit to the TF-IDF values. Additional filtering can be applied based on the TF-IDF threshold and the ratio of the TF score to the second-best TF score.

Parameters:
  • df (pandas.DataFrame) – DataFrame containing the marker data. See sccellfie.preprocessing.quick_markers for details.

  • tf_col (str, optional (default: 'tf')) – Column name for the Term Frequency (TF) values.

  • idf_col (str, optional (default: 'idf')) – Column name for the Inverse Document Frequency (IDF) values.

  • tfidf_threshold (float, optional (default: None)) – Threshold for the TF-IDF values. If provided, only markers with TF-IDF values above this threshold are kept. A value of 0.3 is recommended for most datasets.

  • tfidf_col (str, optional (default: 'tf_idf')) – Column name for the TF-IDF values. Used for filtering based on the TF-IDF threshold.

  • tf_ratio (float, optional (default: None)) – Threshold for the ratio of the TF score to the second-best TF score. If provided, only markers with a ratio above this threshold are kept. A value of 1.2 is recommended for most datasets.

  • second_best_tf_col (str, optional (default: 'second_best_tf')) – Column name for the second-best TF values. Used for filtering based on the TF ratio.

  • group_col (str, optional (default: 'cluster')) – Column name for the cluster labels. Used for filtering based on the TF ratio. This is to keep markers when the cluster equals the second-best cluster (very specific marker).

  • second_best_group_col (str, optional (default: 'second_best_cluster')) – Column name for the second-best cluster labels. Used for filtering based on the TF ratio. This is to keep markers when the cluster equals the second-best cluster (very specific marker).

Returns:

  • filtered_df (pandas.DataFrame) – DataFrame containing the filtered markers.

  • theoretical_curve (tuple) – Tuple containing the x and y values of the theoretical hyperbolic curve.

sccellfie.external.markers_to_dict(markers_df, n_markers=10, sort_by='tf_idf', cluster_col='cluster', gene_col='gene', ascending=False)[source]

Converts a markers DataFrame to a dictionary mapping cluster names to lists of marker genes.

Parameters:
  • markers_df (pandas.DataFrame) – DataFrame containing marker data with cluster and gene information.

  • n_markers (int, optional (default: 10)) – Number of top markers to select per cluster.

  • sort_by (str, optional (default: 'tf_idf')) – Column name to sort markers by for each cluster.

  • cluster_col (str, optional (default: 'cluster')) – Column name containing cluster labels.

  • gene_col (str, optional (default: 'gene')) – Column name containing gene names.

  • ascending (bool, optional (default: False)) – Whether to sort in ascending order. Default is False (descending order for TF-IDF).

Returns:

markers_dict – Dictionary mapping cluster names to lists of marker gene names. Keys are naturally sorted cluster names.

Return type:

dict

IO

sccellfie.io.load_adata(folder, filename, reactions_filename=None, metabolic_tasks_filename=None, spatial_network_key='spatial_network', verbose=True)[source]

Loads an AnnData object and its scCellFie attributes from a folder.

Parameters:
  • folder (str) – The folder to load the AnnData object.

  • filename (str) – The name of the file to load the AnnData object.

  • reactions_filename (str, optional (default: None)) – The name of the file (without extension) to load the reactions object. If None, the default name is filename_reactions.

  • metabolic_tasks_filename (str, optional (default: None)) – The name of the file (without extension) to load the metabolic_tasks object. If None, the default name is filename_metabolic_tasks.

  • spatial_network_key (str, optional (default: 'spatial_network')) – The key in adata.uns or a scCellFie_attribute.uns where the spatial knn graph is stored if exists.

  • verbose (bool, optional (default: True)) – Whether to print the file names that were loaded.

Returns:

adata – Annotated data matrix. If scCellFie attributes are found, they are also loaded into adata.reactions and adata.metabolic_tasks.

Return type:

AnnData object

sccellfie.io.save_adata(adata, output_directory, filename, spatial_network_key='spatial_network', verbose=True)[source]

Saves an AnnData object and its scCellFie attributes to a folder.

Parameters:
  • adata (AnnData object) – Annotated data matrix.

  • output_directory (str) – Directory to save the results (AnnData objects).

  • filename (str) – The name of the file to save the AnnData object. Do not include the file extension.

  • spatial_network_key (str, optional (default: 'spatial_network')) – The key in adata.uns or a scCellFie_attribute.uns where the spatial knn graph is stored.

  • verbose (bool, optional (default: True)) – Whether to print the file names that were saved.

Returns:

The AnnData object is saved to folder/filename.h5ad. The scCellFie attributes are saved to:

  • reactions: folder/filename_reactions.h5ad.

  • metabolic_tasks: folder/filename_metabolic_tasks.h5ad.

Return type:

None

sccellfie.io.save_result_summary(results_dict, output_directory, prefix='')[source]

Save the result summary contained in a dictionary to CSV files.

Parameters:
  • results_dict (dict) – Dictionary containing the DataFrames with results from the sccellfie.reports.summary.generate_report_from_adata() function.

  • output_directory (str) – Directory to save the results.

  • prefix (str, optional (default: '')) – Prefix to add to the filenames.

sccellfie.io.load_segmentation(filepath: str, cell_ids: ndarray | None = None, cell_id_col: str | None = None, vertex_x_col: str | None = None, vertex_y_col: str | None = None, output: str = 'geodataframe') gpd.GeoDataFrame | dict[source]

Load cell boundary polygons from a segmentation file.

Generic loader for any vertex-table format (one row per polygon vertex). Supports Xenium parquet, CSV.gz, CSV, TSV, and TSV.gz with auto-detection of column names.

Parameters:
  • filepath (str) – Path to the cell boundaries file. Accepted extensions are .parquet, .csv.gz, .csv, .tsv, and .tsv.gz.

  • cell_ids (np.ndarray, optional (default: None)) – If provided, only load polygons for these cell IDs.

  • cell_id_col (str, optional (default: None)) – Column name for cell identifiers. Auto-detected if None. Tries "cell_id", "ID", "id", "cell_ID" in that order.

  • vertex_x_col (str, optional (default: None)) – Column name for vertex x-coordinates. Auto-detected if None. Tries "vertex_x", "x_location", "X".

  • vertex_y_col (str, optional (default: None)) – Column name for vertex y-coordinates. Auto-detected if None. Tries "vertex_y", "y_location", "Y".

  • output ({"geodataframe", "dict"}, optional (default: "geodataframe")) – Return format. "geodataframe" returns a GeoDataFrame indexed by cell ID with centroid_x / centroid_y columns. "dict" returns a mapping of cell_id -> shapely.Polygon.

Returns:

Cell boundary polygons in the requested format.

Return type:

geopandas.GeoDataFrame or dict

sccellfie.io.load_xenium_segmentation(filepath: str, cell_ids: ndarray | None = None, cell_id_col: str | None = None, vertex_x_col: str | None = None, vertex_y_col: str | None = None, output: str = 'geodataframe') gpd.GeoDataFrame | dict[source]

Load cell boundaries from a Xenium cell_boundaries file.

Thin wrapper around load_segmentation() kept for discoverability. Xenium cell_boundaries files use the default auto-detected columns (cell_id, vertex_x, vertex_y) so this is equivalent to calling load_segmentation() directly.

See load_segmentation() for parameter and return documentation.

sccellfie.io.load_segmentation_from_gdf(gdf, geometry_col: str = 'geometry')[source]

Prepare a pre-loaded GeoDataFrame for downstream plotting.

Adds centroid_x and centroid_y columns if missing.

Parameters:
  • gdf (geopandas.GeoDataFrame) – GeoDataFrame with polygon geometries.

  • geometry_col (str, optional (default: "geometry")) – Name of the geometry column.

Returns:

Input GeoDataFrame with centroid columns added.

Return type:

geopandas.GeoDataFrame

sccellfie.io.read_xenium(data_dir: str | Path, slide_id: str | None = None, segmentation: str = 'cell', cluster_file: str | Path | bool | None = None, spatial_key: str = 'X_spatial', verbose: bool = True) AnnData[source]

Read a 10x Xenium output bundle into an AnnData.

Parameters:
  • data_dir (str or Path) – Path to the Xenium bundle root, or to a directory containing one sub-directory per slide (in which case slide_id selects the slide).

  • slide_id (str, optional (default: None)) – Sub-directory under data_dir. When None, data_dir itself is treated as the bundle root.

  • segmentation ({"cell", "nucleus"}, optional (default: "cell")) – "cell" reads cell_feature_matrix.h5 and joins centroids from cells.csv.gz. "nucleus" reads nucleus_feature_matrix.h5ad and pulls centroids from its obs columns x_centroid / y_centroid.

  • cluster_file (str, Path, False, or None, optional (default: None)) – Path to a cluster-assignment CSV (with columns Barcode, Cluster). When None, analysis/clustering/gene_expression_graphclust/clusters.csv is auto-loaded if present. Pass False to skip the lookup.

  • spatial_key (str, optional (default: "X_spatial")) – Key under which centroids are stored in adata.obsm. Defaults to scCellFie’s canonical key; pass "spatial" if you also want scanpy.pl.spatial to find them.

  • verbose (bool, optional (default: True)) – Print informational messages.

Returns:

AnnData with centroid coordinates in adata.obsm[spatial_key] and any cluster assignments in adata.obs['cluster'].

Return type:

anndata.AnnData

sccellfie.io.read_visium(path: str | Path, *, count_file: str = 'filtered_feature_bc_matrix.h5', library_id: str | None = None, source_image_path: str | Path | None = None, is_hd: bool = False, hd_layout: str = 'detect', genome: str | None = None, load_images: bool = True) AnnData[source]

Read a 10x Visium / VisiumHD bundle into an AnnData.

Standard Visium and VisiumHD-bins layouts delegate to scanpy.read_visium(). The VisiumHD-segmented layout (presence of cell_segmentations.geojson) is handled by a custom branch that derives centroids and per-cell areas from the polygons, optionally merges nucleus areas from nucleus_segmentations.geojson, and writes coordinates under both obsm['spatial'] and obsm['X_spatial'].

Parameters:
  • path (str or Path) – Path to the Visium bundle directory.

  • count_file (str, optional (default: "filtered_feature_bc_matrix.h5")) – Filename of the count matrix inside path.

  • library_id (str, optional (default: None)) – Identifier used as the key under adata.uns['spatial']. When None it is read from the count file’s HDF5 attributes.

  • source_image_path (str or Path, optional (default: None)) – Path to the high-resolution tissue image, recorded under adata.uns['spatial'][library_id]['metadata']['source_image_path'].

  • is_hd (bool, optional (default: False)) – Whether this is a VisiumHD bundle. Used together with hd_layout to dispatch to the right branch.

  • hd_layout ({"detect", "bins", "segmented", "standard"}, optional (default: "detect")) – Force a specific HD layout. "detect" (default) auto-detects: "segmented" if cell_segmentations.geojson is present, otherwise "bins" if spatial/tissue_positions.parquet is present, otherwise "standard".

  • genome (str, optional (default: None)) – Filter expression to genes within this genome (passed through to scanpy.read_visium()).

  • load_images (bool, optional (default: True)) – Whether to load hires/lowres tissue images.

Returns:

AnnData with spatial information stored in standard scanpy format. For the segmented branch, also exposes adata.obsm['X_spatial'] (scCellFie convention).

Return type:

anndata.AnnData

Plotting

sccellfie.plotting.plot_communication_network(ccc_scores, sender_col, receiver_col, score_col, score_threshold=None, panel_size=(12, 8), network_layout='spring', edge_color='magenta', edge_width=25, edge_arrow_size=20, edge_alpha=0.25, node_color='#210070', node_size=1000, node_alpha=0.9, node_label_size=12, node_label_alpha=0.7, node_label_offset=(0.05, -0.2), title=None, title_fontsize=14, ax=None, save=None, dpi=300, tight_layout=True, bbox_inches='tight')[source]

Plots a network of cell-cell communication. Edges represent communication scores between cells. These scores could be an overall communication score or a specific ligand-receptor pair score.

Parameters:
  • ccc_scores (pandas.DataFrame) – DataFrame containing the cell-cell communication scores. It should contain columns for the sender cell, receiver cell, and the communication score.

  • sender_col (str) – Column name for the sender cell.

  • receiver_col (str) – Column name for the receiver cell.

  • score_col (str) – Column name for the communication score.

  • score_threshold (float, optional (default: None)) – Threshold for the communication score. If provided, only scores above this threshold are plotted.

  • panel_size (tuple, optional (default: (12, 8))) – Size of the plot panel. Only works if ax is None.

  • network_layout (str, optional (default: 'spring')) – Layout of the network graph. Should be either ‘spring’ or ‘circular’.

  • edge_color (str, optional (default: 'magenta')) – Color of the edges.

  • edge_width (float, optional (default: 25)) – Width of the edges.

  • edge_arrow_size (float, optional (default: 20)) – Size of the edge arrows.

  • edge_alpha (float, optional (default: 0.25)) – Transparency of the edges.

  • node_color (str, optional (default: '#210070')) – Color of the nodes.

  • node_size (int, optional (default: 1000)) – Size of the nodes.

  • node_alpha (float, optional (default: 0.9)) – Transparency of the nodes.

  • node_label_size (int, optional (default: 12)) – Font size of the node labels.

  • node_label_alpha (float, optional (default: 0.7)) – Transparency of the node labels.

  • node_label_offset (tuple, optional (default: (0.05, -0.2))) – Offset of the node labels.

  • title (str, optional (default: None)) – Title of the plot.

  • title_fontsize (int, optional (default: 14)) – Font size of the title.

  • ax (matplotlib.axes.Axes, optional (default: None)) – Axes object where the plot will be drawn. If None, a new figure is created.

  • save (str, optional (default: None)) – Filepath to save the plot. If None, the plot is not saved.

  • dpi (int, optional (default: 300)) – Resolution of the saved plot.

  • tight_layout (bool, optional (default: True)) – Whether to use tight layout for the plot.

  • bbox_inches (str, optional (default: 'tight')) – Bounding box in inches. Only used if save is provided.

Returns:

  • fig (matplotlib.figure.Figure) – The matplotlib figure object.

  • ax (matplotlib.axes.Axes) – The matplotlib axes object.

sccellfie.plotting.create_volcano_plot(de_results, effect_threshold=0.75, padj_threshold=0.05, cell_type=None, group1=None, group2=None, effect_col='cohens_d', effect_title="Cohen's d", wrapped_title_length=50, save=None, dpi=300, tight_layout=True)[source]

Creates a volcano plot for differential analysis results.

Parameters:
  • de_results (pd.DataFrame) – A DataFrame containing the results of the differential analysis. Required columns: ‘feature’, ‘adj_p_value’, and the column specified in effect_col. Optional columns: ‘cell_type’, ‘group1’, ‘group2’.

  • effect_threshold (float, optional (default: 0.75)) – The threshold for the effect size (e.g., log2 fold change or Cohen’s d) to consider a variable significant.

  • padj_threshold (float, optional (default: 0.05)) – The threshold for the adjusted p-value to consider a variable significant.

  • cell_type (str, optional (default: None)) – The specific cell type to plot. If None and cell_type column exists, all cell types are plotted.

  • group1 (str, optional (default: None)) – The first group in the comparison. If None, all group1 values are included.

  • group2 (str, optional (default: None)) – The second group in the comparison. If None, all group2 values are included.

  • effect_col (str, optional (default: 'cohens_d')) – The column in de_results that contains the effect size values.

  • effect_title (str, optional (default: "Cohen's d")) – The title to use for the effect size in the plot.

  • wrapped_title_length (int, optional (default: 50)) – The maximum number of characters per line in the title.

  • save (str, optional (default: None)) – The file path to save the plot. If None, the plot is not saved. A file extension (e.g., ‘.png’) can be provided to specify the file format.

  • dpi (int, optional (default: 300)) – The resolution of the saved figure.

  • tight_layout (bool, optional (default: True)) – Whether to use tight layout for the plot.

Returns:

A list of feature names that are considered significant based on the provided thresholds, sorted by effect size in ascending order. Returns an empty list if no significant features are found.

Return type:

list

Notes

This function creates a volcano plot where: - x-axis represents the effect size (e.g., log2 fold change or Cohen’s d) - y-axis represents the -log10(adjusted p-value) - Gray points indicate non-significant features - Red points indicate significant features that pass both thresholds - Dashed lines indicate the significance thresholds

sccellfie.plotting.create_comparative_violin(adata, significant_features, group1, group2, condition_key, celltype, cell_type_key, xlabel='Feature', ylabel='Metabolic Activity', title=None, wrapped_title_length=50, figsize=(16, 7), fontsize=10, violin_cut=0, palette=['coral', 'lightsteelblue'], lgd_bbox_to_anchor=(1.05, 1), lgd_loc='upper left', save=None, dpi=300, tight_layout=True)[source]

Compares features between two groups for a specific cell type in an AnnData object and creates a violin plot.

Parameters:
  • adata (AnnData) – An AnnData object containing the data.

  • significant_features (list) – List of significant feature names from the volcano plot function, sorted by effect size.

  • group1 (str) – The name of the first group to compare.

  • group2 (str) – The name of the second group to compare.

  • condition_key (str) – The column name in adata.obs containing the condition information.

  • celltype (str) – The cell type to analyze.

  • cell_type_key (str) – The column name in adata.obs containing the cell type information.

  • xlabel (str, optional (default: 'Feature')) – The label for the x-axis.

  • ylabel (str, optional (default: 'Metabolic Activity')) – The label for the y-axis.

  • title (str, optional (default: None)) – The title for the plot. If None, a default title is used.

  • wrapped_title_length (int, optional (default: 50)) – The maximum number of characters per line in the title.

  • figsize (tuple, optional (default: (16, 7))) – The figure size.

  • fontsize (int, optional (default: 10)) – The font size for the labels and legend.

  • violin_cut (float, optional (default: 0)) –

    The cut parameter for the violin plot. Distance, in units of bandwidth,

    to extend the density past extreme datapoints. Set to 0 to limit the violin within the data range.

  • palette (list, optional (default: ['coral', 'lightsteelblue'])) – The color palette for the plot. Each color corresponds to a group or condition.

  • lgd_bbox_to_anchor (tuple, optional (default: (1.05, 1))) – The bbox_to_anchor parameter for the legend.

  • lgd_loc (str, optional (default: 'upper left')) – The location of the legend.

  • save (str, optional (default: None)) – The file path to save the plot. If None, the plot is not saved.

  • dpi (int, optional (default: 300)) – The resolution of the saved figure.

  • tight_layout (bool, optional (default: True)) – Whether to use tight layout for the plot.

Returns:

fig, ax – The matplotlib Figure and Axes objects for the plot.

Return type:

matplotlib.pyplot.Figure, matplotlib.pyplot.Axes

sccellfie.plotting.create_beeswarm_plot(df, x='log2FC', y='cell_type', cohen_threshold=0.5, pval_threshold=0.05, show_n_significant=True, logfc_threshold=1.0, title=None, title_fontsize=20, ticks_fontsize=14, labels_fontsize=16, condition1_color='#8B0000', condition2_color='#000080', ns_color='#808080', strip_size=4, strip_alpha=0.6, strip_jitter=0.2, lgd_fontsize=14, lgd_marker_size=12, lgd_frameon=False, lgd_loc='upper left', lgd_bbox_to_anchor=(1.1, 1), sort_lambda=None, figsize=(10, 12), save=None, dpi=300, tight_layout=True)[source]

Creates a beeswarm plot to visualize differential analysis results. X-axis represents the effect size (e.g., log2 fold change or Cohen’s d). Y-axis represents the cell types or any categorical variable.

Parameters:
  • df (DataFrame) – A DataFrame containing the results of the differential analysis, using the function pairwise_differential_analysis. The DataFrame should have at lest the following columns: ‘cell_type’, ‘feature’, ‘group1’, ‘group2’, ‘log2FC’, ‘cohens_d’, ‘adj_p_value’.

  • x (str, optional (default: 'log2FC')) – The column in df to use as the x-axis.

  • y (str, optional (default: 'cell_type')) – The column in df to use as the y-axis.

  • cohen_threshold (float, optional (default: 0.5)) – The threshold for Cohen’s D to consider a feature significant.

  • pval_threshold (float, optional (default: 0.05)) – The threshold for the adjusted p-value to consider a feature significant.

  • show_n_significant (bool, optional (default: True)) – Whether to show the count of significant features per cell type.

  • logfc_threshold (float, optional (default: 1.0)) – The threshold for the log2 fold change to consider a feature significant.

  • title (str, optional (default: None)) – The title for the plot. If None, a default title is used.

  • title_fontsize (int, optional (default: 20)) – The font size for the title.

  • ticks_fontsize (int, optional (default: 14)) – The font size for the ticks.

  • labels_fontsize (int, optional (default: 16)) – The font size for the labels.

  • condition1_color (str, optional (default: '#8B0000')) – The color for the first condition.

  • condition2_color (str, optional (default: '#000080')) – The color for the second condition.

  • ns_color (str, optional (default: '#808080')) – The color for non-significant features.

  • strip_size (int, optional (default: 4)) – The size of the strip plot points.

  • strip_alpha (float, optional (default: 0.6)) – The transparency of the strip plot points.

  • strip_jitter (float, optional (default: 0.2)) – The amount of jitter to apply to the strip plot points.

  • lgd_fontsize (int, optional (default: 14)) – The font size for the legend.

  • lgd_marker_size (int, optional (default: 12)) – The size of the legend markers.

  • lgd_frameon (bool, optional (default: False)) – Whether to show the legend frame.

  • lgd_loc (str, optional (default: 'upper left')) – The location of the legend.

  • lgd_bbox_to_anchor (tuple, optional (default: (1.1, 1))) – The bbox_to_anchor parameter for the legend.

  • sort_lambda (function, optional (default: None)) – A lambda function to sort the y-axis values. If None, the values are sorted by the y-axis column.

  • figsize (tuple, optional (default: (10, 12))) – The figure size.

  • save (str, optional (default: None)) – The file path to save the plot. If None, the plot is not saved.

  • dpi (int, optional (default: 300)) – The resolution of the saved figure.

  • tight_layout (bool, optional (default: True)) – Whether to use tight layout for the plot.

Returns:

  • fig, ax (matplotlib.pyplot.Figure, matplotlib.pyplot.Axes) – The matplotlib Figure and Axes objects for the plot.

  • sig_df (DataFrame) – The input DataFrame filtered to only include significant features. Index is set to ‘cell_type’ and ‘feature’.

sccellfie.plotting.create_multi_violin_plots(adata, features, groupby, n_cols=4, figsize=(5, 5), ylabel=None, title=None, fontsize=10, rotation=90, wrapped_title_length=45, save=None, dpi=300, tight_layout=True, w_pad=None, h_pad=None, **kwargs)[source]

Plots a grid of violin plots for multiple genes in Scanpy, controlling the number of columns.

Parameters:
  • adata (AnnData) – Annotated data matrix.

  • features (list of str) – List of feature names to plot. Should match names in adata.var_names.

  • groupby (str) – Key in adata.obs containing the groups to plot. For each unique value in this column, a violin plot will be generated.

  • n_cols (int, optional (default: 4)) – Number of columns in the grid.

  • figsize (tuple of float, optional (default: (5, 5))) – Size of each subplot in inches.

  • ylabel (str, optional (default: None)) – Label for the y-axis. If None, the label will be the variable name.

  • title (list of str, optional (default: None)) – List of labels for each feature. If None, the feature name will be used.

  • fontsize (int, optional (default: 10)) – Font size for the title and axis labels. The tick labels will be set to fontsize, while the title will be set to fontsize + 4. Ylabel will be set to fontsize + 2.

  • rotation (int, optional (default: 90)) – Rotation of the x-axis tick labels

  • wrapped_title_length (int, optional (default: 50)) – The maximum number of characters per line in the title.

  • save (str, optional (default: None)) – Filepath to save the figure. If not provided, the figure will be displayed.

  • dpi (int, optional (default: 300)) – Resolution of the saved figure.

  • tight_layout (bool, optional (default: True)) – Whether to use tight layout.

  • w_pad (float, optional (default: None)) – Width padding between subplots.

  • h_pad (float, optional (default: None)) – Height padding between subplots.

  • **kwargs (dict) – Additional arguments to pass to sc.pl.violin. For example, rotation can be used to rotate the x-axis labels.

sccellfie.plotting.create_radial_plot(metabolic_df, task_info_df, cell_type=None, tissue=None, task_col='metabolic_task', category_col='System', value_col='scaled_trimean', tissue_col='tissue', cell_type_col='cell_type', figsize=(6, 6), title='Metabolic activities', palette='Dark2', title_fontsize=24, legend_fontsize=14, legend_loc='center left', legend_bbox_to_anchor=(1.1, 0.5), alpha_fill=0.25, alpha_bg=0.1, ylim=1.0, sort_by_value=False, ax=None, show_legend=True, save=None, dpi=300, bbox_inches='tight', tight_layout=True)[source]

Creates a radial plot of metabolic task activities grouped by category.

Parameters:
  • metabolic_df (pandas.DataFrame) – DataFrame containing metabolic task activities. Typically, it corresponds to the ‘melted’ dataframe in the outputs from sccellfie.reports.summary.generate_report_from_adata(). Required columns: task_col, value_col, cell_type_col, tissue_col.

  • task_info_df (pandas.DataFrame) – DataFrame containing task categorization information. Required columns: task_col and category_col.

  • cell_type (str, optional (default: None)) – The specific cell type to plot. If None, the maximum activity across all cell types within the specified tissue is used.

  • tissue (str, optional (default: None)) – The specific tissue to plot. If None, all tissues are included.

  • task_col (str, optional (default: 'metabolic_task')) – The column name in metabolic_df containing task identifiers.

  • category_col (str, optional (default: 'System')) – The column name in task_info_df containing category information.

  • value_col (str, optional (default: 'scaled_trimean')) – The column name in metabolic_df containing activity values.

  • tissue_col (str, optional (default: 'tissue')) – The column name in metabolic_df containing tissue information.

  • cell_type_col (str, optional (default: 'cell_type')) – The column name in metabolic_df containing cell type information.

  • figsize (tuple, optional (default: (6, 6))) – The size of the figure. Only used if ax is None.

  • title (str, optional (default: 'Metabolic activities')) – The title for the plot. Set to None to disable the title.

  • palette (str, optional (default: 'Dark2)) – Name of a palette for coloring the categories of metabolic tasks.

  • title_fontsize (int, optional (default: 24)) – Font size for the title.

  • legend_fontsize (int, optional (default: 14)) – Font size for the legend.

  • legend_loc (str, optional (default: "center left")) – Location of the legend.

  • legend_bbox_to_anchor (tuple, optional (default: (1.1, 0.5))) – Position of the legend relative to the legend_loc.

  • alpha_fill (float, optional (default: 0.25)) – Alpha transparency for the filled areas.

  • alpha_bg (float, optional (default: 0.1)) – Alpha transparency for the background areas.

  • ylim (float, optional (default: 1.0)) – Limit value for the y-axis (radial direction). If None, the maximum value across all tasks is used instead.

  • sort_by_value (bool, optional (default: False)) – If True, tasks within each category are sorted by their value. If False, tasks are sorted alphabetically within each category.

  • ax (matplotlib.axes.Axes, optional (default: None)) – A matplotlib axes with polar projection to draw the plot on. If None, a new figure and axes are created.

  • show_legend (bool, optional (default: True)) – Whether to display the legend.

  • save (str, optional (default: None)) – The filepath to save the figure. If None, the figure is not saved.

  • dpi (int, optional (default: 300)) – The resolution of the saved figure.

  • bbox_inches (str, optional (default: 'tight')) – The bbox_inches parameter for saving the figure.

  • tight_layout (bool, optional (default: True)) – Whether to use tight layout for the plot. Only applied if ax is None.

Returns:

  • fig (matplotlib.figure.Figure) – The matplotlib figure object.

  • ax (matplotlib.axes.Axes) – The matplotlib axes object.

Examples

>>> import pandas as pd
>>> from sccellfie.plotting import create_radial_plot
>>>
>>> # Load example data
>>> metabolic_df = pd.read_csv('Melted.csv')
>>> task_info_df = pd.read_csv('TaskInfo.csv')
>>>
>>> # Create radial plot for maximum activities across all cell types in a tissue
>>> fig, ax = create_radial_plot(metabolic_df, task_info_df, tissue='Blood')
>>> plt.show()
>>>
>>> # Create radial plot for a specific cell type in a specific tissue
>>> fig, ax = create_radial_plot(metabolic_df, task_info_df, cell_type='T cell', tissue='Blood')
>>> plt.show()
>>>
>>> # Create multiple subplots with shared legend
>>> fig = plt.figure(figsize=(20, 10))
>>> ax1 = fig.add_subplot(121, projection='polar')
>>> ax2 = fig.add_subplot(122, projection='polar')
>>>
>>> # First subplot with legend
>>> create_radial_plot(metabolic_df, task_info_df, tissue='Blood', ax=ax1, show_legend=True)
>>> # Second subplot without legend
>>> create_radial_plot(metabolic_df, task_info_df, tissue='Liver', ax=ax2, show_legend=False)
>>> plt.tight_layout()
>>> plt.show()
sccellfie.plotting.plot_neighbor_distribution(results, figsize=(15, 8), save=None, dpi=300, bbox_inches='tight', tight_layout=True)[source]

Visualizes the neighbor distribution analysis results.

Parameters:
  • results (dict) – Output from ´sccellfie.spatial.neighborhood.compute_neighbor_distribution´ function

  • figsize (tuple) – Figure size for the combined plots

  • save (str, optional (default: None)) – Filepath to save the figure.

  • dpi (int, optional (default: 300)) – Resolution of the saved figure.

  • bbox_inches (str, optional (default: 'tight')) – Bounding box in inches. Only used if save is provided.

  • tight_layout (bool, optional (default: True)) – Whether to use tight layout.

Returns:

  • fig (matplotlib.figure.Figure) – The matplotlib figure object.

  • gs (matplotlib.gridspec.GridSpec) – The matplotlib gridspec object.

sccellfie.plotting.plot_spatial(adata, keys, suptitle=None, suptitle_fontsize=20, title_fontsize=14, legend_fontsize=12, bkgd_label='H&E', wrapped_title_length=45, ncols=3, hspace=0.15, wspace=0.1, save=None, dpi=300, bbox_inches='tight', tight_layout=True, **kwargs)[source]

Plots spatial expression of multiple genes in Scanpy.

Parameters:
  • adata (AnnData) – AnnData object containing gene expression and spatial information.

  • keys (list of str) – List of feature names to plot. Should match names in adata.var_names or a column in adata.obs.

  • suptitle (str, optional (default: None)) – Title for the entire figure.

  • suptitle_fontsize (int, optional (default: 20)) – Font size for the figure title.

  • title_fontsize (int, optional (default: 14)) – Font size for each subplot title (key name).

  • legend_fontsize (int, optional (default: 12)) – Font size for the legend elements.

  • hspace (float, optional (default: 0.1)) – Height space between subplots.

  • wspace (float, optional (default: 0.1)) – Width space between subplots.

  • bkgd_label (str, optional (default: 'H&E')) – Label for the background image.

  • wrapped_title_length (int, optional (default: 45)) – The maximum number of characters per line in the title.

  • ncols (int, optional (default: 3)) – Number of columns in the grid.

  • save (str, optional (default: None)) – Filepath to save the figure.

  • dpi (int, optional (default: 300)) – Resolution of the saved figure. Only used if save is provided.

  • bbox_inches (str, optional (default: 'tight')) – Bounding box in inches. Only used if save is provided.

  • tight_layout (bool, optional (default: True)) – Whether to use tight layout.

  • **kwargs (dict) – Additional arguments to pass to scanpy.pl.spatial.

Returns:

  • fig (matplotlib.figure.Figure) – The matplotlib figure object.

  • axes (numpy.ndarray) – Array of matplotlib axes.

sccellfie.plotting.plot_segmentation(adata, spatial_key: str = 'X_spatial', color_by: str | Sequence[str] | None = None, celltype_key: str = 'cell_type', segmentation: dict | None = None, cell_id_col: str | None = None, palette: dict | None = None, highlight: List[str] | None = None, layer: str | None = None, crop: Tuple[float, float, float, float] | None = None, invert_yaxis: bool = True, legend: bool = True, legend_loc: str = 'center left', legend_bbox: Tuple[float, float] | None = (1.01, 0.5), legend_frameon: bool = False, legend_title: str | None = None, legend_fontsize: float | None = 7.0, legend_ncol: int = 1, legend_params: dict | None = None, axes_off: bool = True, figsize: Tuple[float, float] | None = None, ax=None, ncols: int = 4, panel_titles: bool = True, title: str | Sequence[str] | None = None, title_fontsize: float | None = 12, wrapped_title_length: int = 45, dpi: int = 150, scatter_size: float = 2.0, cmap: str = 'viridis', vmin: float | None = None, vmax: float | None = None, y_pad_ratio: float = 0.1, x_pad_ratio: float = 0.0, scalebar: bool = True, scalebar_kwargs: dict | None = None, cbar_kwargs: dict | None = None, save: str | None = None)[source]

Plot cell-resolution spatial data from an AnnData object.

Renders cells as segmentation polygons when segmentation is provided, otherwise as a centroid scatter plot. Supports categorical and continuous colouring, optional highlighting of a subset of categories, axis cropping, and a scalebar with bottom/top padding.

When color_by is a list, multiple panels are drawn in a grid laid out by ncols (matching sc.pl.spatial semantics): the geometry, crop, and view limits are computed once and shared across panels; each panel is coloured independently and gets its own legend or colorbar.

Parameters:
  • adata (anndata.AnnData) – AnnData with spatial coordinates in adata.obsm[spatial_key].

  • spatial_key (str, optional (default: "X_spatial")) – Key in adata.obsm for the (n_cells, 2+) coordinate array. Defaults to scCellFie’s canonical key.

  • color_by (str, list of str, or None, optional (default: None)) – Column in adata.obs or name in adata.var_names to colour by. If None, falls back to celltype_key. Pass a list of names (e.g. ["task_A", "task_B", "GENE1"]) to render a multi-panel figure with one panel per feature.

  • celltype_key (str, optional (default: "cell_type")) – Default categorical column used when color_by is None.

  • segmentation (dict, optional (default: None)) – Mapping cell_id -> shapely.Polygon (e.g. output of sccellfie.io.load_segmentation() with output="dict"). If None, a scatter of centroids is drawn.

  • cell_id_col (str, optional (default: None)) – Column in adata.obs identifying cells. Defaults to adata.obs.index.

  • palette (dict, optional (default: None)) – Custom {category: color} mapping for categorical colouring. Falls back to adata.uns["{color_by}_colors"] or matplotlib Set2 cycling. In multi-panel mode the same palette is reused for every categorical feature.

  • highlight (list of str, optional (default: None)) – Subset of categories to highlight; all others are drawn in whitesmoke and excluded from the legend.

  • layer (str, optional (default: None)) – Layer name in adata.layers used when color_by is a gene. If None, uses adata.X.

  • crop (tuple, optional (default: None)) – (minx, miny, maxx, maxy) bounds to restrict the view. Data outside this box is not rendered. If None, uses data extent.

  • invert_yaxis (bool, optional (default: True)) – Invert the y-axis (microscopy convention).

  • legend (bool, optional (default: True)) – Show the legend for categorical data, or a colorbar for continuous data.

  • legend_loc (str, optional (default: "center left")) – loc argument passed to ax.legend(). Ignored for colorbar.

  • legend_bbox (tuple, optional (default: (1.01, 0.5))) – bbox_to_anchor for the legend. Use None to disable the anchor and rely on legend_loc alone.

  • legend_frameon (bool, optional (default: False)) – Whether the legend frame/border is drawn.

  • legend_title (str, optional (default: None)) – Title shown above the legend entries.

  • legend_fontsize (float, optional (default: 7.0)) – Font size for legend labels. None falls back to the matplotlib default. The small default suits spatial plots with many categories; bump it via legend_params={'fontsize': 10} (or the dedicated arg) when needed.

  • legend_ncol (int, optional (default: 1)) – Number of columns in the legend.

  • legend_params (dict, optional (default: None)) – Arbitrary kwargs forwarded to ax.legend(...) (e.g. handlelength, labelspacing, borderpad, columnspacing). Keys here override the dedicated legend_* arguments on conflict.

  • axes_off (bool, optional (default: True)) – Remove ticks, tick labels, and spines (standard for spatial plots).

  • figsize (tuple, optional (default: None)) –

    • Single panel (color_by is a str or None): the figure size, defaulting to (10, 10) when None.

    • Multi panel (color_by is a list): the per-panel size, defaulting to (4, 4) when None. The total figure size is (figsize[0] * ncols, figsize[1] * nrows).

    Ignored when ax is provided.

  • ax (matplotlib.axes.Axes, optional (default: None)) – Existing axes to draw onto. Only valid when color_by is a single feature (or None). For multi-panel, omit ax and let the function build the grid.

  • ncols (int, optional (default: 4)) – Number of columns in the panel grid when color_by is a list. Number of rows is ceil(len(color_by) / ncols). Mirrors sc.pl.spatial’s ncols parameter.

  • panel_titles (bool, optional (default: True)) – Master toggle for panel titles. When True, each panel’s title is set to the corresponding feature name (or to the explicit string passed via title=). Set False to suppress titles entirely (in single- and multi-panel modes).

  • title (str, list of str, or None, optional (default: None)) – Explicit title override. For single-feature mode pass a string; for multi-feature mode pass a list of strings whose length matches color_by. When None (default), titles are auto-derived from the feature names. Ignored if panel_titles=False.

  • title_fontsize (float, optional (default: 12)) – Font size of the per-panel title. Mirrors the convention in sccellfie.plotting.plot_spatial().

  • wrapped_title_length (int, optional (default: 45)) – Maximum number of characters per title line. Long feature names (e.g. metabolic-task labels) are wrapped via textwrap.wrap() before being set, matching the behavior of the other tool plots (plot_spatial(), create_multi_violin_plots(), create_volcano_plot()). Pass a large value (e.g. 1000) to disable wrapping.

  • dpi (int, optional (default: 150)) – Figure DPI; also used when save is set.

  • scatter_size (float, optional (default: 2.0)) – Marker size for centroid scatter mode.

  • cmap (str, optional (default: "viridis")) – Matplotlib colormap name for continuous colouring.

  • vmin (float, optional (default: None)) – Lower / upper bounds for continuous colouring. When set, values outside [vmin, vmax] are clipped at the colormap edges and the colorbar is restricted to that range. Ignored for categorical colouring. In multi-panel mode the same bounds apply to every panel (useful for comparing features on a shared scale). Pass only one of the two to cap a single side.

  • vmax (float, optional (default: None)) – Lower / upper bounds for continuous colouring. When set, values outside [vmin, vmax] are clipped at the colormap edges and the colorbar is restricted to that range. Ignored for categorical colouring. In multi-panel mode the same bounds apply to every panel (useful for comparing features on a shared scale). Pass only one of the two to cap a single side.

  • y_pad_ratio (float, optional (default: 0.1)) – Fraction of the y range added as top/bottom whitespace (so the scalebar label has room).

  • x_pad_ratio (float, optional (default: 0.0)) – Fraction of the x range added as left/right whitespace. Default keeps x tight to data — increase when the legend or a colorbar sits to the right of the plot and you want extra breathing room on the data side too.

  • scalebar (bool, optional (default: True)) – Draw a scalebar on every panel.

  • scalebar_kwargs (dict, optional (default: None)) – Overrides for the scalebar (e.g. length, units, color, position, pad_frac, fontsize, text_pad_pts). pad_frac is the inset of the bar from the axes corner as a fraction of the axes height/width; text_pad_pts is the gap (in points) between the bar and its label. The label is always placed on the side of the bar away from the data, so it never overlaps cells when y_pad_ratio > 0.

  • cbar_kwargs (dict, optional (default: None)) – Overrides passed to plt.colorbar for continuous colouring.

  • save (str, optional (default: None)) – If given, save the figure to this path with dpi and bbox_inches="tight".

Returns:

  • fig (matplotlib.figure.Figure) – The matplotlib figure object.

  • ax (matplotlib.axes.Axes or numpy.ndarray of Axes) – Single Axes when color_by is a string (or None); a 2D array of Axes (shape (nrows, ncols)) when color_by is a list.

Preprocessing

sccellfie.preprocessing.get_adata_gene_expression(adata, gene, layer=None, use_raw=False)[source]

Get expression values for a given feature from AnnData object. Checks both adata.var_names (gene expression) and adata.obs (metadata).

Parameters:
  • adata (AnnData) – AnnData object containing the expression data.

  • gene (str) – Name of the gene or feature to extract the expression values for.

  • layer (str, optional (default: None)) – Name of the layer to extract the expression values from. This layer has priority over adata.X and ´use_raw´.

  • use_raw (bool, optional (default: False)) – If True, use the raw data in adata.raw.X if available.

Returns:

expression – Array containing the expression values for the specified gene.

Return type:

numpy.ndarray

sccellfie.preprocessing.stratified_subsample_adata(adata, group_column, target_fraction=0.2, random_state=0)[source]

Stratified subsampling of an AnnData object.

Parameters:
  • adata (AnnData) – Annotated data matrix.

  • group_column (str) – Column name in adata.obs containing the group information.

  • target_fraction (float, optional (default: 0.20)) – Fraction of cells to sample from each group.

  • random_state (int, optional (default: 0)) – Random seed for reproducibility.

Returns:

adata_subsampled – Subsampled AnnData object

Return type:

AnnData

sccellfie.preprocessing.normalize_adata(adata, target_sum=10000, n_counts_key='n_counts', chunk_size=None, copy=False)[source]

Memory-efficient normalization of AnnData object. Works directly on sparse matrices without converting to dense.

Parameters:
  • adata (AnnData) – Annotated data matrix containing the expression data.

  • target_sum (int, optional (default: 10_000)) – The target sum to which the data will be normalized.

  • n_counts_key (str, optional (default: 'n_counts')) – The key in adata.obs containing the total counts for each cell.

  • chunk_size (int or None, optional (default: None)) – If None, process entire matrix at once (faster, more memory). If int, process matrix in chunks of this size (slower, less memory). Recommended for very large datasets (>1M cells).

  • copy (bool, optional (default: False)) – If True, returns a copy of adata with the normalized data.

sccellfie.preprocessing.transform_adata_gene_names(adata, filename=None, organism='human', copy=True, drop_unmapped=False)[source]

Transforms gene names in an AnnData object from Ensembl IDs to gene symbols.

Parameters:
  • adata (AnnData) – Annotated data matrix containing the expression data. All gene names must be in Ensembl ID format.

  • filename (str, optional) – The file path to a custom CSV file containing Ensembl IDs and gene symbols. One column must be ‘ensembl_id’ and the other ‘symbol’.

  • organism (str, optional (default: 'human')) – The organism to retrieve data for. Choose ‘human’ or ‘mouse’.

  • copy (bool, optional (default: True)) – If True, return a copy of the AnnData object. If False, modify the object in place.

  • drop_unmapped (bool, optional (default: False)) – If True, drop genes that could not be mapped to symbols.

Returns:

The AnnData object with gene names transformed to gene symbols. If copy=True, this is a new object.

Return type:

AnnData

Raises:

ValueError – If not all genes in the AnnData object are in Ensembl ID format.

sccellfie.preprocessing.transfer_variables(adata_target, adata_source, var_names, source_obs_col=None, target_obs_col=None, keep_sparse=True)[source]

Transfers variables from source AnnData to target AnnData, handling different sizes and maintaining sparse matrix format if needed.

Parameters:
  • adata_target (AnnData) – Target AnnData object to add variables to.

  • adata_source (AnnData) – Source AnnData object to get variables from.

  • var_names (str or list) – Names of variables to transfer from ´adata_source´ to ´adata_target´.

  • source_obs_col (str, optional) – Column in source adata.obs to use for matching observations (e.g. column containing barcodes).

  • target_obs_col (str, optional) – Column in target adata.obs to use for matching observations (e.g. column containing barcodes).

  • keep_sparse (bool) – Whether to maintain sparse matrix format if present.

Returns:

Updated target AnnData object with new variables

Return type:

AnnData

sccellfie.preprocessing.add_complexes_to_adata(adata, complexes, agg_method='min', layer=None, copy=False)[source]

Adds multi-gene complex expression as new variables in an AnnData object.

Computes per-cell aggregated expression for each complex and appends the result as new columns in adata.X. Layers are handled by computing the complex aggregation for the source layer and zero-filling others.

Parameters:
  • adata (AnnData) – AnnData object containing expression data with individual gene expression.

  • complexes (dict) – Dictionary mapping complex names (str) to lists of subunit gene names. Example: {‘ITGA4&ITGB1’: [‘ITGA4’, ‘ITGB1’]}

  • agg_method (str, default='min') –

    Aggregation across subunits per cell. Options:
    • ’min’ : Minimum expression (rate-limiting subunit).

    • ’mean’ : Arithmetic mean expression.

    • ’gmean’ : Geometric mean expression.

  • layer (str, optional) – Layer to read subunit expression from. If None, uses adata.X.

  • copy (bool, default=False) – If True, return a modified copy. If False, modify adata in place and return None.

Returns:

If copy=True, returns the modified AnnData. Otherwise modifies adata in place and returns None.

Return type:

AnnData or None

Raises:

ValueError – If agg_method is not one of ‘min’, ‘mean’, ‘gmean’. If any subunit gene is not found in adata.var_names. If a complex name already exists in adata.var_names.

sccellfie.preprocessing.make_complex_name(subunits, separator='&')[source]

Generates a canonical complex name from a list of subunit names.

Parameters:
  • subunits (list of str) – List of subunit gene/task names.

  • separator (str, default='&') – Character(s) used to join the sorted subunit names.

Returns:

Canonical complex name with sorted subunits joined by separator.

Return type:

str

sccellfie.preprocessing.prepare_var_pairs(adata, var_pairs, complex_sep='&', agg_method='min', layer=None)[source]

Prepares variable pairs for communication scoring by detecting multi-element (complex) entries, adding them to adata, and returning normalized string-only pairs.

Each element in a var_pair can be either a string (single gene/task) or a list/tuple of strings (complex with multiple subunits). When a list is detected, the complex is automatically named by joining the sorted subunit names with complex_sep and added to adata via add_complexes_to_adata. Complexes already present in adata.var_names are skipped.

Parameters:
  • adata (AnnData) – AnnData object containing expression data.

  • var_pairs (list of tuples) –

    List of (ligand, receptor) pairs where each element can be:
    • str: single gene or task name.

    • list/tuple of str: subunits of a complex.

    Example:

    var_pairs = [
        (['TASK1', 'TASK2'], ['GENE1', 'GENE2']),  # both complex
        ('TASK3', 'GENE4'),                          # both single
        ('TASK1', ['GENE5', 'GENE6']),               # mixed
    ]
    

  • complex_sep (str, default='&') – Separator used to join subunit names into the complex name.

  • agg_method (str, default='min') – Aggregation method for complex subunits. See add_complexes_to_adata.

  • layer (str, optional) – Layer to read subunit expression from.

Returns:

normalized_pairs – String-only (ligand, receptor) pairs ready for scoring functions. Complex elements are replaced by their generated names.

Return type:

list of tuples

sccellfie.preprocessing.get_element_associations(df, element, axis_element=0)[source]

Gets the tasks, reactions, or genes associated with a given element in the DataFrame.

Parameters:
  • df (pandas.DataFrame) – DataFrame containing the associations.

  • element (str) – Element for which to get the associations. This can be a task, reaction, or gene. Name should match exactly the name in indexes or columns of the DataFrame.

  • axis_element (int, optional (default: 0)) – Axis along which the element is located. Can be 0 (rows) or 1 (columns).

Returns:

associations – List of tasks, reactions, or genes associated with the given element.

Return type:

list of str

sccellfie.preprocessing.add_new_task(task_by_rxn, task_by_gene, rxn_by_gene, task_info, rxn_info, task_name, task_system, task_subsystem, rxn_names, gpr_hgncs, gpr_symbols)[source]

Adds a new task and their associated reactions and genes to the database.

Parameters:
  • task_by_rxn (pandas.DataFrame) – DataFrame representing the relationship between tasks and reactions.

  • task_by_gene (pandas.DataFrame) – DataFrame representing the relationship between tasks and genes.

  • rxn_by_gene (pandas.DataFrame) – DataFrame representing the relationship between reactions and genes.

  • task_info (pandas.DataFrame) – DataFrame containing information about tasks, including the task name, system (major group of tasks), and subsystem (specific group of tasks).

  • rxn_info (pandas.DataFrame) – DataFrame containing information about reactions, including the reaction name, and the associated GPR rules in HGNC and symbol format.

  • task_name (str) – Name of the task to add.

  • task_system (str) – System (major group of tasks) to which the task belongs.

  • task_subsystem (str) – Subsystem (specific group of tasks) to which the task belongs.

  • rxn_names (list of str) – List of reaction names associated with the task.

  • gpr_hgncs (list of str) – List of GPR rules in HGNC format associated with the reactions. Order should match the order of the reaction names.

  • gpr_symbols (list of str) – List of GPR rules in symbol format associated with the reactions. Order should match the order of the reaction names.

Returns:

  • task_by_rxn (pandas.DataFrame) – Updated DataFrame representing the relationship between tasks and reactions.

  • task_by_gene (pandas.DataFrame) – Updated DataFrame representing the relationship between tasks and genes.

  • rxn_by_gene (pandas.DataFrame) – Updated DataFrame representing the relationship between reactions and genes.

  • task_info (pandas.DataFrame) – Updated DataFrame containing information about tasks, including the task name, system (major group of tasks), and subsystem (specific group of tasks).

  • rxn_info (pandas.DataFrame) – Updated DataFrame containing information about reactions, including the reaction name, and the associated GPR rules in HGNC and symbol format.

sccellfie.preprocessing.combine_and_sort_dataframes(df1, df2, preference='max')[source]

Combines two DataFrames and sort the rows and columns alphabetically.

Parameters:
  • df1 (pandas.DataFrame) – First DataFrame to combine.

  • df2 (pandas.DataFrame) – Second DataFrame to combine.

  • preference (str, optional) – Preference for which value to keep when both dataframes have the same cell. Options: ‘max’ (default), ‘min’, ‘df1’, ‘df2’.

Returns:

combined_df – Combined DataFrame with all rows and columns from df1 and df2, sorted alphabetically. Missing values are filled with 0.

Return type:

pandas.DataFrame

sccellfie.preprocessing.handle_duplicate_indexes(df, value_column=None, operation='first')[source]

Handles duplicated indexes in a DataFrame by keeping the min, max, mean, first, or last value associated with them in a specified column.

Parameters:
  • df (pandas.DataFrame) – DataFrame with duplicated indexes.

  • value_column (str, optional (default: None)) –

    Name of the column containing values to make a decision

    when handling duplicated indexes. This value is optional only when operation is ‘first’ or ‘last’.

  • operation (str, optional (default: 'first')) – Operation to perform when handling duplicated indexes. Options: ‘min’, ‘max’, ‘mean’, ‘first’, ‘last’.

Returns:

df_result – DataFrame with duplicated indexes handled according to the specified operation

Return type:

pandas.DataFrame

sccellfie.preprocessing.clean_gene_names(gpr_rule)[source]

Removes spaces between parentheses and gene IDs in a GPR rule.

Parameters:

gpr_rule (str) – GPR rule to clean.

Returns:

cleaned_gpr – Cleaned GPR rule, without spaces between parentheses and gene IDs.

Return type:

str

sccellfie.preprocessing.find_genes_gpr(gpr_rule)[source]

Finds all gene IDs in a GPR rule.

Parameters:

gpr_rule (str) – GPR rule to search for gene IDs.

Returns:

genes – List of gene IDs found in the GPR rule.

Return type:

list of str

sccellfie.preprocessing.replace_gene_ids_in_gpr(gpr_rule, gene_id_mapping)[source]

Replaces gene IDs in a GPR rule with new IDs (different nomenclature).

Parameters:
  • gpr_rule (str) – GPR rule to update.

  • gene_id_mapping (dict) – Dictionary mapping old gene IDs to new gene IDs.

Returns:

updated_gpr_rule – GPR rule with gene IDs replaced by new IDs.

Return type:

str

sccellfie.preprocessing.convert_gpr_nomenclature(gpr_rules, id_mapping)[source]

Converts gene IDs in multiple GPR rules to a different nomenclature.

Parameters:
  • gpr_rules (list of str) – List of GPR rules to update.

  • id_mapping (dict) – Dictionary mapping old gene IDs to new gene IDs.

Returns:

converted_rules – List of GPR rules with gene IDs replaced by new IDs.

Return type:

list of str

sccellfie.preprocessing.get_matrix_gene_expression(matrix, var_names, gene, normalize=False)[source]

Safely extracts expression values for a gene from any matrix type.

Parameters:
  • matrix (numpy.ndarray) – The matrix containing the expression data. Rows correspond to cells and columns to genes.

  • var_names (list or pandas.Index) – The index or array containing the gene names.

  • gene (str) – The gene name to extract.

  • normalize (bool, optional (default: False)) – If True, apply min-max normalization to the expression values.

Returns:

expression – An array containing the expression values for the specified gene.

Return type:

numpy.ndarray

sccellfie.preprocessing.min_max_normalization(df, axis=0)[source]

Applies min-max normalization along specified axis.

Parameters:
  • df (pandas.DataFrame or array-like) – The input DataFrame to be normalized.

  • axis (int, optional (default: 0)) – The axis along which to normalize. Use 0 to normalize each column or 1 to normalize each row using their cognate min and max values.

Returns:

df_scaled – A DataFrame containing the normalized values. Minimum and maximum values are calculated along the specified axis. Minimum and maximum values are 0 and 1, respectively. NaN values are filled with 0.

Return type:

pandas.DataFrame

sccellfie.preprocessing.compute_dataframes_correlation(df1, df2, col_name=None, method='spearman')[source]

Computes correlations between one column in ´df1´ and all columns in another ´df2´.

Parameters:
  • df1 (pandas.DataFrame) – DataFrame of which one column will be correlated against multiple columns in df2.

  • df2 (pandas.DataFrame) – DataFrame containing multiple columns to correlate against the single column in df1.

  • col_name (str, optional (default: None)) – The name of the column in df1 to correlate against df2. If None, the first column in df1 is used.

  • method (str, optional (default: 'spearman')) – The correlation method to use. Either ‘pearson’ or ‘spearman’.

Returns:

DataFrame with correlation coefficients for each column in multi_column_df

Return type:

pandas.DataFrame

sccellfie.preprocessing.preprocess_inputs(adata, gpr_info, task_by_gene, rxn_by_gene, task_by_rxn, correction_organism='human', gene_fraction_threshold=0.0, reaction_fraction_threshold=0.0, verbose=True)[source]

Preprocesses inputs for metabolic analysis.

Parameters:
  • adata (AnnData) – Annotated data matrix.

  • gpr_info (pandas.DataFrame) – DataFrame containing reaction IDs and their corresponding Gene-Protein-Reaction (GPR) rules.

  • task_by_gene (pandas.DataFrame) – DataFrame representing the relationship between tasks and genes.

  • rxn_by_gene (pandas.DataFrame) – DataFrame representing the relationship between reactions and genes.

  • task_by_rxn (pandas.DataFrame) – DataFrame representing the relationship between tasks and reactions.

  • correction_organism (str, optional (default: 'human')) – Organism of the input data. This is important to correct gene names that are present in scCellFie’s or custom database. Check options in sccellfie.preprocessing.prepare_inputs.CORRECT_GENES.keys()

  • gene_fraction_threshold (float, optional (default: 0.0)) – The minimum fraction of genes in a reaction’s GPR that must be present in adata to keep the reaction. Range is 0 to 1. 1.0 means all genes must be present. Any value > 0 and < 1 keeps reactions with at least that fraction of genes present. 0 means keep reactions with at least one gene present.

  • reaction_fraction_threshold (float, optional (default: 0.0)) – The minimum fraction of reactions in a task that must be present after gene filtering to keep the task. Range is 0 to 1. 1.0 means all reactions must be present. Any value > 0 and < 1 keeps tasks with at least that fraction of reactions present. 0 means keep tasks with at least one reaction present.

  • verbose (bool, optional (default: True)) – If True, prints information about the preprocessing results.

Returns:

  • adata2 (AnnData) – Filtered annotated data matrix.

  • gpr_rules (dict) – Dictionary of GPR rules for the filtered reactions.

  • task_by_gene (pandas.DataFrame) – Filtered DataFrame representing the relationship between tasks and genes.

  • rxn_by_gene (pandas.DataFrame) – Filtered DataFrame representing the relationship between reactions and genes.

  • task_by_rxn (pandas.DataFrame) – Filtered DataFrame representing the relationship between tasks and reactions.

Reports

sccellfie.reports.compute_dataset_completeness(adata, gpr_source, task_by_rxn, ablation_impact=None, reaction_impact=None, metric='fraction_zeroed', threshold=1.0, disable_pbar=True)[source]

Evaluate dataset completeness relative to a metabolic task database, at both essential-gene and all-gene scopes, in a single pass.

“Missing” at the dataset level means the gene symbol does not appear in adata.var_names, i.e. the assay does not cover it. This is a property of the dataset as a whole, not of any particular cell.

Parameters:
  • adata (AnnData) – The user’s expression data; only adata.var_names is consulted.

  • gpr_source (dict) – Either {reaction_id: cobra.core.gene.GPR} (as returned by sccellfie.preprocessing.prepare_inputs.preprocess_inputs) or {reaction_id: str} of raw GPR strings.

  • task_by_rxn (pandas.DataFrame) – Rows are tasks, columns are reactions; non-zero where reaction participates in task.

  • ablation_impact (dict of DataFrames, optional) – Output of sccellfie.stats.compute_gene_ablation_impact at the task level. If None, it is computed internally.

  • reaction_impact (dict of DataFrames, optional) – Reaction-level ablation (each reaction treated as its own task). Computed internally when None.

  • metric (str, optional (default: 'fraction_zeroed')) – Impact metric used when deriving essential-gene sets via essential_genes_from_ablation. One of ‘rel_change’, ‘abs_change’, ‘fraction_zeroed’.

  • threshold (float, optional (default: 1.0)) – Threshold paired with metric for essential-gene derivation.

  • disable_pbar (bool, optional (default: True)) – Forwarded to internal compute_gene_ablation_impact calls when impacts are computed here.

Returns:

Three flat DataFrames with dual essential/all scopes as suffixed columns:
  • ’task_completeness’ : one row per task.

  • ’reaction_completeness’ : one row per reaction.

  • ’overall_summary’ : single row with aggregate stats.

Return type:

dict

sccellfie.reports.compute_cell_completeness(adata, gpr_source, task_by_rxn, ablation_impact=None, metric='fraction_zeroed', threshold=1.0, layer=None, write_to_obs=True, obs_key_prefix='completeness_', return_matrix=False, disable_pbar=True)[source]

Per-cell completeness relative to the metabolic-task database, at both essential-gene and all-gene scopes.

“Missing” for a given cell and gene means either (a) the gene is absent from adata.var_names (dataset-absent, constant across cells) or (b) the gene is in adata.var_names but has expression == 0 in that cell. Missing genes contribute their rel_change impact on each task; per-cell per-task completeness is 1 - clip(sum of impacts, 0, 1). The final per-cell score aggregates across tasks via the mean.

Parameters:
  • adata (AnnData) – Expression data. adata.X (or adata.layers[layer] if provided) is used to determine which genes are zero in which cells.

  • gpr_source (dict) – As in compute_dataset_completeness.

  • task_by_rxn (pandas.DataFrame) – Tasks x reactions membership matrix.

  • ablation_impact (dict of DataFrames, optional) – Task-level ablation output. Computed internally if None.

  • metric (see compute_dataset_completeness.)

  • threshold (see compute_dataset_completeness.)

  • layer (str, optional) – Layer in adata.layers from which to read expression. Defaults to adata.X.

  • write_to_obs (bool, optional (default: True)) – If True, writes adata.obs[obs_key_prefix + ‘essential’] and adata.obs[obs_key_prefix + ‘all’].

  • obs_key_prefix (str, optional (default: ‘completeness_’)) – Prefix for the obs columns when write_to_obs=True.

  • return_matrix (bool, optional (default: False)) – If True, also return the dense (cell x task) per-scope completeness matrices.

  • disable_pbar (bool, optional (default: True)) – Forwarded to internal compute_gene_ablation_impact call when impact is computed here.

Returns:

  • ‘per_cell’ : DataFrame(cells x [‘completeness_essential’, ‘completeness_all’])

  • ’matrix_essential’ : DataFrame(cells x tasks) or None

  • ’matrix_all’ : DataFrame(cells x tasks) or None

Return type:

dict

sccellfie.reports.generate_completeness_report(adata, gpr_source, task_by_rxn, ablation_impact=None, reaction_impact=None, metric='fraction_zeroed', threshold=1.0, layer=None, write_to_obs=True, obs_key_prefix='completeness_', return_matrix=False, disable_pbar=True)[source]

Run both compute_dataset_completeness and compute_cell_completeness in one call and return {‘dataset’: …, ‘cell’: …}. Shares the same ablation impact across both sub-reports to avoid duplicate computation.

sccellfie.reports.generate_report_from_adata(adata, group_by, agg_func='trimean', layer=None, features=None, tissue_col=None, feature_name='feature', min_cells=1, threshold=np.float64(3.4657359027997265), default_tissue_name='tissue', **kwargs)[source]

Process AnnData object and calculate metrics for each group (e.g., cell type).

Parameters:
  • adata (AnnData) – AnnData object containing the expression data.

  • group_by (str) – Column name in adata.obs for the groups (e.g., cell types).

  • agg_func (str, optional (default: 'trimean')) – The aggregation function to apply. Options are ‘mean’, ‘median’, ‘25p’ (25th percentile), ‘75p’ (75th percentile), ‘trimean’ (0.5*Q2 + 0.25(Q1+Q3)), and ‘topmean’ (computed among the top `top_percent`% of values).

  • layer (str, optional (default: None)) – Name of the layer in adata to use. If None, uses adata.X.

  • features (list, optional (default: None)) – Names of features to analyze. If None, uses adata.var_names.

  • tissue_col (str, optional (default: None)) – Column name in adata.obs for tissue information.

  • feature_name (str, optional (default: 'feature')) – Name to use for features in melted results (e.g., ‘metabolic_task’, ‘reaction’).

  • min_cells (int, optional (default: 1)) – Minimum number of cells required for a group to be included in the analysis.

  • threshold (float, optional (default: 5*np.log(2))) – Threshold value for counting cells passing expression threshold.

  • default_tissue_name (str, optional (default: 'tissue')) – Default tissue name to use when tissue_column is not provided.

  • **kwargs (dict) – Additional arguments to pass to the aggregation function.

Returns:

Dictionary containing DataFrames for each metric:
  • agg_values: Aggregated values (e.g., trimean) per group

  • variance: Variance values per group

  • std: Standard deviation values per group

  • threshold_cells: Number of cells passing threshold per group

  • nonzero_cells: Number of non-zero cells per group

  • cell_counts: Number of cells per group

  • min_max: Min/max values for features

  • melted: Melted version of all metrics

Return type:

dict

Spatial

Stats

sccellfie.stats.compute_gene_ablation_impact(gpr_source, task_by_rxn, genes=None, uniform_score=1.0, disable_pbar=False)[source]

Simulate single-gene ablation on a synthetic uniform-expression reference and measure per-task impact.

For each gene, set its gene_score to 0 (leaving every other gene at uniform_score), re-evaluate every reaction whose GPR contains the gene, then recompute metabolic-task scores using the same arithmetic as sccellfie.metabolic_task.compute_mt_score.

Parameters:
  • gpr_source (dict) – Either {reaction_id: cobra.core.gene.GPR} (as returned by sccellfie.preprocessing.prepare_inputs.preprocess_inputs) or {reaction_id: str} of raw GPR strings (parsed internally via cobra.core.gene.GPR().from_string).

  • task_by_rxn (pandas.DataFrame) – Rows are metabolic tasks, columns are reactions. Cell (T, r) is non-zero iff reaction r participates in task T.

  • genes (list of str, optional (default: None)) – Subset of genes to ablate. Default uses the union of all genes across the GPRs. Genes not appearing in any GPR contribute an all-zero row.

  • uniform_score (float, optional (default: 1.0)) – Positive score assigned to every non-ablated gene. Exposed mainly for testing; rel_change and fraction_zeroed are scale-invariant, while abs_change scales linearly with this value.

  • disable_pbar (bool, optional (default: False)) – Disable the per-gene progress bar.

Returns:

Three (gene x task) DataFrames keyed by:
  • ’rel_change’(baseline_mts - ablated_mts) / baseline_mts, in [0, 1].

    1.0 means the gene fully zeros the task under uniform reference.

  • ’abs_change’ : baseline_mts - ablated_mts.

  • ’fraction_zeroed’: 1 iff ablated_mts == 0 and baseline_mts > 0, else 0.

Return type:

dict[str, pandas.DataFrame]

Notes

Under a single-cell uniform reference every reaction’s baseline RAL equals 5*log(1 + uniform_score/uniform_score) = 5*log(2) when reached through compute_gene_scores, but here we call the GPR walker directly on gene scores (not raw expression), so baseline RAL and baseline MTS are both equal to uniform_score exactly (min/max of constant uniform_score values). This is a property of the walker, not of the gene_score transform.

sccellfie.stats.compute_reaction_topology_essentiality(task_by_rxn, cobra_model, task_endpoints, treat_reversible_as_bidirectional=True, ignore_metabolites=None)[source]

For each task with a user-supplied (start_metabolite, end_metabolite), flag reactions that are essential for connecting start -> end through the task’s metabolite graph.

The graph has metabolites as nodes and reactions as edges. For each reaction in the task that is present in the cobra Model, an edge is added from every substrate to every product. When treat_reversible_as_bidirectional is True, reversible reactions also contribute reverse edges. Optionally, metabolites in ignore_metabolites are excluded from the graph (useful for currency metabolites like ATP/ADP/H+/H2O).

A reaction is essential iff removing all of its edges disconnects start_met from end_met. Tasks without a specified endpoint pair are skipped (their column in the output is all False).

Parameters:
  • task_by_rxn (pandas.DataFrame) – Rows are tasks, columns are reactions, non-zero where the reaction participates in the task.

  • cobra_model (cobra.Model) – Genome-scale metabolic model whose reaction IDs and metabolite IDs match those used in task_by_rxn.

  • task_endpoints (dict[str, tuple[str, str]]) – {task_name: (start_metabolite_id, end_metabolite_id)}. Only tasks listed here are evaluated; others get an all-False column.

  • treat_reversible_as_bidirectional (bool, optional (default: True)) – If True, reactions with rxn.reversibility == True contribute edges in both directions. If False, edges follow the nominal substrate -> product direction only.

  • ignore_metabolites (set of str, optional (default: None)) – Metabolite IDs to exclude as graph nodes. Edges that would use any of them are not added.

Returns:

(reactions x tasks) boolean DataFrame. True at (r, T) iff removing reaction r disconnects the start -> end path in task T. Rows are indexed by all reaction IDs in task_by_rxn.columns.

Return type:

pandas.DataFrame

sccellfie.stats.essential_genes_from_ablation(impact, metric='fraction_zeroed', threshold=1.0, topology=None, task_by_rxn=None, gpr_source=None, fallback_to_ablation_only=True)[source]

Derive per-task essential-gene lists from the ablation impact output, optionally filtered by a reaction-level topology essentiality DataFrame.

Parameters:
  • impact (dict[str, pandas.DataFrame] or pandas.DataFrame) – Output from compute_gene_ablation_impact, or one of its DataFrames.

  • metric (str, optional (default: 'fraction_zeroed')) – Which impact DataFrame to threshold when impact is a dict. Must be one of ‘rel_change’, ‘abs_change’, ‘fraction_zeroed’.

  • threshold (float, optional (default: 1.0)) – A gene is flagged essential for task T iff impact[metric].loc[g, T] >= threshold.

  • topology (pandas.DataFrame, optional (default: None)) – (reactions x tasks) boolean DataFrame from compute_reaction_topology_essentiality. When provided, a gene is essential only if, in addition to clearing the threshold, at least one of the reactions it appears in (for that task) is marked essential by the topology.

  • task_by_rxn (pandas.DataFrame, optional) – Required when topology is provided. Defines each task’s reaction membership.

  • gpr_source (dict, optional) – Required when topology is provided. Same format as compute_gene_ablation_impact (GPR objects or strings). Used to map genes to reactions.

  • fallback_to_ablation_only (bool, optional (default: True)) – When topology is provided but a given task’s column is all-False (not evaluated, no endpoints, or missing in the model): if True, fall back to the plain ablation threshold for that task. If False, yield [] for that task.

Returns:

{task_name: sorted list of essential genes}.

Return type:

dict[str, list[str]]

sccellfie.stats.cohens_d(group1, group2)[source]

Calculates Cohen’s d effect size for two groups.

Parameters:
  • group1 (array-like) – Values from the first group of samples.

  • group2 (array-like) – Values from the second group of samples.

Returns:

d – Cohen’s d effect size.

Return type:

float

sccellfie.stats.scanpy_differential_analysis(adata, cell_type, cell_type_key, condition_key, condition_pairs=None, var_names=None, alpha=0.05, min_cells=30, downsample=False, n_iterations=50, agg_method='mean', random_state=None)[source]

Performs differential expression analysis using Scanpy’s rank_genes_groups function.

Parameters:
  • adata (AnnData object) – Annotated data matrix containing the expression data.

  • cell_type (str or None) – The cell type to analyze. If None, analysis is performed for all cell types.

  • cell_type_key (str) – The column name in adata.obs containing the cell type information.

  • condition_key (str) – The column name in adata.obs containing the condition information.

  • condition_pairs (list of tuples, optional (default: None)) – The pairs of conditions to compare. If None, all pairs of conditions are compared.

  • var_names (list of str, optional (default: None)) – The list of variable names (e.g. genes) to perform the differential expression analysis on. If None, all genes are used.

  • alpha (float, optional (default: 0.05)) – The significance level for the multiple testing correction.

  • min_cells (int, optional (default: 30)) – Minimum number of cells required in each group for comparison.

  • downsample (bool, optional (default: False)) – Whether to perform downsampling to balance group sizes.

  • n_iterations (int, optional (default: 50)) – Number of subsampling iterations if downsample=True.

  • agg_method (str, optional (default: 'mean')) – Method to aggregate results across iterations (‘mean’ or ‘median’).

  • random_state (int, optional (default: None)) – Random seed for reproducibility of downsampling.

Returns:

df_results

A DataFrame containing the results of the differential expression analysis with columns:
  • cell_type: The analyzed cell type

  • feature: Name of the analyzed feature

  • group1: First condition in the comparison

  • group2: Second condition in the comparison

  • log2FC: Log2 fold change between means of conditions

  • test_statistic: Wilcoxon test statistic

  • p_value: Raw p-value

  • adj_p_value: BH-corrected p-value

  • cohens_d: Effect size (Cohen’s d)

  • n_group1: Number of observations in group1

  • n_group2: Number of observations in group2

  • median_group1: Median expression in group1

  • median_group2: Median expression in group2

  • median_diff: Difference in medians (group2 - group1)

Return type:

pandas.DataFrame

sccellfie.stats.pairwise_differential_analysis(adata, groupby, var_names=None, order=None, alternative='two-sided', alpha=0.05)[source]

Performs pairwise Wilcoxon tests for each feature between all group pairs. This functions does not perform the test in a cell type-wise manner. For that, use ´scanpy_differential_analysis´.

Parameters:
  • adata (AnnData) – AnnData object containing the expression data.

  • groupby (str) – Column in adata.obs containing group labels.

  • var_names (list, optional (default: None)) – List of feature names to test. If None, all features are tested.

  • order (list, optional (default: None)) – Specific order of groups to test. If None, groups are sorted.

  • alternative (str, optional (default: 'two-sided')) – Alternative hypothesis for the Wilcoxon rank-sum test. Options are ‘two-sided’, ‘greater’, ‘less’.

  • alpha (float, optional (default: 0.05)) – Significance level for multiple testing correction.

Returns:

df – A DataFrame containing the results with the same columns as scanpy_differential_analysis (except ‘cell_type’) for consistency:

  • feature: Name of the analyzed feature

  • group1: First condition in the comparison

  • group2: Second condition in the comparison

  • log2FC: Log2 fold change between conditions

  • test_statistic: Wilcoxon test statistic

  • p_value: Raw p-value

  • adj_p_value: BH-corrected p-value

  • cohens_d: Effect size (Cohen’s d)

  • n_group1: Number of observations in group1

  • n_group2: Number of observations in group2

  • median_group1: Median expression in group1

  • median_group2: Median expression in group2

  • median_diff: Difference in medians (group2 - group1)

Return type:

pandas.DataFrame

sccellfie.stats.generate_pseudobulks(adata, cell_type_key, n_pseudobulks=5, cells_per_bulk=1000, layer=None, use_raw=False, genes=None, agg_func='trimean', continuous_key=None, random_seed=None)[source]

Generates pseudo-bulk samples from single-cell data. Each pseudo-bulk represents a group of cells from the same cell type.

Parameters:
  • adata (AnnData) – An AnnData object containing the single-cell expression data.

  • cell_type_key (str) – The key in adata.obs that contains the cell type annotations.

  • n_pseudobulks (int, optional (default: 5)) – The number of pseudo-bulk samples to generate for each cell type. Less will be generated if there are fewer cells than the n_pseudobulks * cells_per_bulk.

  • cells_per_bulk (int, optional (default: 1000)) – The number of cells to include in each pseudo-bulk sample. Less will be used if there are fewer cells in the cell type.

  • layer (str, optional (default: None)) – The name of the layer in adata to use for aggregation. If None, the main expression matrix adata.X is used.

  • use_raw (bool, optional (default: False)) – Whether to use the data in adata.raw.X (True) or in adata.X (False).

  • genes (list, optional (default: None)) – List of gene names to include in the pseudo-bulk samples. If None, all genes are included.

  • agg_func (str, optional (default: 'trimean')) – The aggregation function to apply. Options are ‘mean’, ‘median’, ‘25p’ (25th percentile), ‘75p’ (75th percentile), ‘trimean’ (0.5*Q2 + 0.25(Q1+Q3)), and ‘topmean’ (computed among the top `top_percent`% of values).

  • continuous_key (str, optional (default: None)) – The key in adata.obs that contains continuous values to include in the pseudo-bulk samples. If None, continuous values are not included. This is useful for trajectory analysis or other continuous annotations.

  • random_seed (int, optional (default: None)) – Random seed for reproducible pseudo-bulk generation.

Returns:

adata_pseudobulk – An AnnData object containing the pseudo-bulk samples. The expression values are aggregated across the cells in each pseudo-bulk. The obs DataFrame contains the cell type annotations and the continuous values if provided.

Return type:

AnnData

sccellfie.stats.fit_gam_model(adata, cell_type_key, cell_type_order=None, continuous_key=None, genes=None, layer=None, use_raw=False, n_splines=10, spline_order=3, lam=0.6, normalize=False, use_pseudobulk=False, n_pseudobulks=5, cells_per_bulk=1000, pseudobulk_agg='trimean', **kwargs)[source]

Fits Generalized Additive Models (GAMs) to single-cell data for each gene.

Parameters:
  • adata (AnnData) – An AnnData object containing the single-cell expression data.

  • cell_type_key (str) – The key in adata.obs that contains the cell type annotations.

  • cell_type_order (list, optional (default: None)) – The order in which to process cell types. If None, cell types are processed in alphabetical order. This is useful when you have a known biological order for the cell types.

  • continuous_key (str, optional (default: None)) – The key in adata.obs that contains continuous values to include in the GAM models. If None, continuous values are not included. This is useful for trajectory analysis or other continuous annotations.

  • genes (list, optional (default: None)) – List of gene names to include in the GAM models. If None, all genes are included.

  • layer (str, optional (default: None)) – The name of the layer in adata to use for aggregation. If None, the main expression matrix adata.X is used.

  • use_raw (bool, optional (default: False)) – Whether to use the data in adata.raw.X (True) or in adata.X (False).

  • n_splines (int, optional (default: 10)) – Number of splines to use for the feature function in the GAM. Must be non-negative.

  • spline_order (int, optional (default: 3)) – Order of spline to use for the feature function in the GAM. Must be non-negative.

  • lam (float, optional (default: 0.6)) – Strength of smoothing penalty in the GAM. Must be a positive float. Larger values enforce stronger smoothing.

  • normalize (bool, optional (default: False)) – Whether to normalize the expression values for each gene. This normalization is of the type min-max scaling, where the minimum and maximum values are 0 and 1.

  • use_pseudobulk (bool, optional (default: False)) – Whether to use pseudobulk samples for the GAM analysis. If True, the GAM models are fitted to the aggregated expression values for each cell type. This is useful for reducing the biased on the statistical power due to having many single cells.

  • n_pseudobulks (int, optional (default: 5)) – The number of pseudo-bulk samples to generate for each cell type.

  • cells_per_bulk (int, optional (default: 1000)) – The number of cells to include in each pseudo-bulk sample.

  • pseudobulk_agg (str, optional (default: 'trimean')) – The aggregation function to apply when generating the pseudo-bulk samples. Options are ‘mean’, ‘median’, ‘25p’ (25th percentile), ‘75p’ (75th percentile), ‘trimean’ (0.5*Q2 + 0.25(Q1+Q3)), and ‘topmean’ (computed among the top `top_percent`% of values).

  • kwargs (dict, optional) – Additional keyword arguments to pass to the GAM model. You can find more about it in the pygam documentation: https://pygam.readthedocs.io/en/latest/api/gam.html.

Returns:

result – A dictionary containing the fitted GAM models, the model scores, and additional information about the pseudo-bulk assignments and cell type encoder when applicable.

Return type:

dict

sccellfie.stats.analyze_gam_results(gam_results, significance_threshold=0.05, fdr_level=0.05)[source]

Analyzes GAM model results with FDR correction using statsmodels.

Parameters:
  • gam_results (dict) – A dictionary containing the results of the GAM analysis. It should contain the ‘scores’ key with a DataFrame of model scores for each gene.

  • significance_threshold (float, optional (default: 0.05)) – The significance threshold to consider a gene as significant.

  • fdr_level (float, optional (default: 0.05)) – The False Discovery Rate (FDR) level to correct for multiple testing.

Returns:

results_df – A DataFrame containing the model scores for each gene, along with the adjusted p-values and significance based on the significance threshold and FDR level.

Return type:

pandas.DataFrame

sccellfie.stats.get_task_determinant_genes(adata, metabolic_task, task_by_rxn, groupby=None, group=None, min_activity=0.0)[source]

Finds the genes that determine the activity of all reactions in a metabolic task. Returns determinant genes for each reaction and their activity across specified cell groups, along with the fraction of cells in each group where the gene was determinant.

Parameters:
  • adata (AnnData object) – Annotated data matrix.

  • metabolic_task (str) – Name of the metabolic task to analyze. Must be one of the tasks in the task_by_rxn DataFrame. It must also be present in the adata.metabolic_tasks attribute.

  • task_by_rxn (pandas.DataFrame) – A pandas.DataFrame object where rows are metabolic tasks and columns are reactions. Each cell contains ones or zeros, indicating whether a reaction is involved in a metabolic task.

  • groupby (str, optional (default: None)) – The key in the adata.obs DataFrame to group by. This could be any categorical annotation of cells (e.g., cell type, cluster).

  • group (str or list, optional (default: None)) – The group(s) in the adata.obs DataFrame to analyze. If None, the analysis is performed by treating all single cells as a group. If groups is specified, groupby must be specified. The column referred by groupby must contain the groups specified in group.

  • min_activity (float, optional (default: 0.0)) – Minimum reaction activity level to consider a reaction as active. Only genes that are associated with active reactions are considered. If zero, all reactions and therefore all genes are considered.

Returns:

df – A pandas.DataFrame reporting the determinant genes for each reaction in the metabolic task. The DataFrame has the following columns:

  • Group: The cell group.

  • Rxn: The reaction.

  • Det-Gene: The determinant gene for the reaction.

  • RAL: The reaction activity level for the reaction.

  • Cell_fraction: The fraction of cells in the group where this gene was determinant.

Return type:

pandas.DataFrame

Notes

This function assumes that reaction activity levels have been computed using sccellfie.reaction_activity.compute_reaction_activity() and are stored in adata.reactions.X.

Scores are computed as previously indicated in the CellFie paper (https://doi.org/10.1016/j.crmeth.2021.100040).