API Reference
This section provides detailed documentation for all modules and functions in scCellFie.
scCellFie
Communication
- sccellfie.communication.compute_local_colocalization_scores(adata, var1, var2, neighbors_radius, method='pairwise_concordance', spatial_key='X_spatial', min_neighbors=3, threshold1=None, threshold2=None, score_key=None, inplace=True)[source]
Computes local colocalization scores between two variables for each spatial spot.
- Parameters:
adata (AnnData) – AnnData object containing expression data and spatial coordinates.
var1 (str) – Name of first variable to analyze.
var2 (str) – Name of second variable to analyze.
neighbors_radius (float) – Radius for assigning a neighborhood of a spot (neighbors within this radius are considered, and the sport is the center).
method (str, optional (default: 'pairwise_concordance')) – Method to compute colocalization: - ‘correlation’: Local Pearson correlation between var1 and var2 across spot & neighbors. - ‘concordance’: Compute the fraction of spots where both genes are expressed above their thresholds. - ‘pairwise_concordance’: Compute the fraction of spot pairs in the neighborhood where var1 and var2 are expressed above their thresholds in sport 1 and 2, respectively. - ‘cosine’: Local cosine similarity between var1 and var2 across spot & neighbors. - ‘weighted_gmean’: Local weighted geometric mean across spot & neighbors (weighted by distance). - ‘regularized_weighted_gmean’: Local regularized and weighted geometric mean across spot & neighbors (weighted by distance).
spatial_key (str, optional (default: 'spatial')) – Key in adata.obsm containing spatial coordinates
min_neighbors (int, optional (default: 3)) – Minimum number of neighbors required for computing score. If less neighbors are found, score is NaN.
threshold1 (float, optional (default: None)) – Threshold for var1. If None, the mean of var1 is used.
threshold2 (float, optional (default: None)) – Threshold for var2. If None, the mean of var2 is used.
score_key (str, optional (default: None)) – Key to store the computed colocalization scores in adata.obs. If None, a default key is used.
inplace (bool, optional (default: True)) – If True, the computed scores are added to adata.obs. Otherwise, the scores are returned as a numpy array.
- Returns:
Array of colocalization scores for each spot
- Return type:
- sccellfie.communication.compute_communication_scores(adata, groupby, var_pairs, communication_score='gmean', agg_func='mean', layer=None, ligand_threshold=0, receptor_threshold=0)[source]
Computes communication scores between pairs of features or variables (normally representing ligand-receptor pairs) across different cell types.
- Parameters:
adata (AnnData) – AnnData object containing expression data and grouping information
groupby (str) – Column in adata.obs for grouping cells to aggregate expression.
var_pairs (list of tuples) – List of (var1, var2) pairs (normally representing ligand-receptor pairs).
communication_score (str, default='gmean') –
- Method to compute communication scores. Options are:
’gmean’: geometric mean (sqrt(x * y))
’product’: simple multiplication (x * y)
’mean’: arithmetic mean ((x + y) / 2)
agg_func (str, default='mean') – Aggregation function for aggregating expression values across cells. Options are ‘mean’, ‘median’, ‘25p’ (25th percentile), ‘75p’ (75th percentile), ‘trimean’ (0.5*Q2 + 0.25(Q1+Q3)), and ‘topmean’.
layer (str, optional) – Layer in adata to use for aggregation. If None, the main expression matrix adata.X is used.
ligand_threshold (float, default=0) – Threshold for calculating the fraction of cells expressing the ligand. Only cells with expression above this threshold are considered as expressing the ligand.
receptor_threshold (float, default=0) – Threshold for calculating the fraction of cells expressing the receptor. Only cells with expression above this threshold are considered as expressing the receptor.
- Returns:
ccc_scores – DataFrame containing the communication scores between cell types for each variable pair. Columns are:
sender_celltype: type of the sender cell
receiver_celltype: type of the receiver cell
ligand: name of the ligand
receptor: name of the receptor
score: communication score
ligand_fraction: fraction of sender cells expressing the ligand
receptor_fraction: fraction of receiver cells expressing the receptor
- Return type:
Datasets
- sccellfie.datasets.retrieve_ensembl2symbol_data(filename=None, organism='human')[source]
Retrieves a dictionary mapping Ensembl IDs to gene symbols for a given organism.
- Parameters:
- Returns:
ensembl2symbol – A dictionary mapping Ensembl IDs to gene symbols
- Return type:
- sccellfie.datasets.load_sccellfie_database(organism='human', task_folder=None, rxn_info_filename=None, task_info_filename=None, task_by_rxn_filename=None, task_by_gene_filename=None, rxn_by_gene_filename=None, thresholds_filename=None)[source]
Loads files of the metabolic task database from either a local folder, individual file paths, or predefined URLs.
- Parameters:
organism (str, optional (default: 'human')) – The organism to retrieve data for. Choose ‘human’ or ‘mouse’. Used when loading from URLs.
task_folder (str, optional (default: None)) – The local folder path containing CellFie data files. If provided, this takes priority.
rxn_info_filename (str, optional (default: None)) – Full path for reaction information JSON file.
task_info_filename (str, optional (default: None)) – Full path for task information CSV file.
task_by_rxn_filename (str, optional (default: None)) – Full path for task by reaction CSV file.
task_by_gene_filename (str, optional (default: None)) – Full path for task by gene CSV file.
rxn_by_gene_filename (str, optional (default: None)) – Full path for reaction by gene CSV file.
thresholds_filename (str, optional (default: None)) – Full path for thresholds CSV file.
- Returns:
data – A dictionary containing the loaded data frames and information. Keys are ‘rxn_info’, ‘task_info’, ‘task_by_rxn’, ‘task_by_gene’, ‘rxn_by_gene’, ‘thresholds’, and ‘organism’. Examples of dataframes can be found at https://github.com/earmingol/scCellFie/raw/refs/heads/main/task_data/homo_sapiens/
- Return type:
Expression
- sccellfie.expression.agg_expression_cells(adata, groupby, layer=None, gene_symbols=None, agg_func='mean', top_percent=10, exclude_zeros=False, use_raw=False, threshold=None)[source]
Aggregates gene expression data for specified cell groups in an AnnData object.
- Parameters:
adata (AnnData) – An AnnData object containing the expression data to be aggregated.
groupby (str) – The key in the adata.obs DataFrame to group by. This could be any categorical annotation of cells (e.g., cell type, condition).
layer (str, optional (default: None)) – The name of the layer in adata to use for aggregation. If None, the main expression matrix adata.X is used.
gene_symbols (str or list, optional (default: None)) – Gene names to include in the aggregation. If a string is provided, it is converted to a single-element list. If None, all genes are included.
agg_func (str, optional (default: 'mean')) – The aggregation function to apply. Options are ‘mean’, ‘median’, ‘25p’ (25th percentile), ‘75p’ (75th percentile), ‘trimean’ (0.5*Q2 + 0.25(Q1+Q3)), ‘topmean’ (computed among the top top_percent`% of values), and ‘fraction_above’ (fraction of cells above threshold) The function must be one of the keys in the `AGG_FUNC dictionary.
top_percent (float, optional (default: 10)) – The percentage of top values to consider when agg_func is ‘topmean’. Ranging from 0 to 100.
exclude_zeros (bool, optional (default: False)) – Whether to exclude zeros when aggregating the values.
use_raw (bool, optional (default: False)) – Whether to use the data in adata.raw.X (True) or in adata.X (False).
threshold (float, optional (default: None)) – Expression threshold used when agg_func is ‘fraction_above’. Represents the minimum expression value for a cell to be considered as expressing the gene.
- Returns:
agg_expression – A pandas.DataFrame where columns correspond to genes and rows correspond to the unique categories in groupby. Each cell in the DataFrame contains the aggregated expression value for the corresponding gene and group.
- Return type:
- Raises:
AssertionError – If the provided agg_func is not a valid key in AGG_FUNC.
Notes
This function is used to compute summary statistics of gene expression data across different groups of cells. It is useful for exploring expression patterns in different cell types or conditions.
The function relies on the groupby parameter in adata.obs to define the groups of cells for which the expression data will be aggregated.
- sccellfie.expression.top_mean(x, axis, percent=10)[source]
Computes the mean of the top x% values along the specified axis of a matrix, handling NaN values.
- Parameters:
x (numpy.ndarray) – The input matrix containing the data to be aggregated.
axis (int) – The axis along which to compute the mean. Use 0 for columns, 1 for rows.
percent (float, (default: 10)) – The percentage of top values to consider, ranging from 0 to 100. For example, 10 would compute the mean of the top 10% of values.
- Returns:
An array containing the mean of the top x% values for each row or column, depending on the specified axis. The shape of the output array will be (n_rows,) if axis=1, or (n_columns,) if axis=0.
- Return type:
- sccellfie.expression.fraction_above_threshold(x, axis, threshold=0)[source]
Computes the fraction of values above a threshold along the specified axis.
- Parameters:
x (numpy.ndarray) – The input matrix containing the data.
axis (int) – The axis along which to compute the fraction. Use 0 for columns, 1 for rows.
threshold (float, (default: 0)) – The threshold value above which to count values.
- Returns:
An array containing the fraction (between 0 and 1) of values above threshold.
- Return type:
- sccellfie.expression.smooth_expression_knn(adata, key_added='smoothed_X', neighbors_key='neighbors', mode='connectivity', alpha=0.33, n_chunks=None, chunk_size=None, use_raw=False, disable_pbar=False)[source]
Smooths expression values based on KNNs of single cells using Scanpy.
- Parameters:
adata (AnnData object) – Annotated data matrix containing the expression data and nearest neighbor graph.
key_added (str, optional (default: 'smoothed_X')) – The key in adata.layers where the smoothed expression matrix will be stored.
neighbors_key (str, optional (default: 'neighbors')) – The key in adata.uns where the information about the pre-run KNN analysis was stored. This key points to a dictionary containing the ‘connectivities_key’, ‘distances_key’, and ‘params’ from the analysis.
mode (str, optional (default: 'connectivity')) – The mode for calculating the smoothing matrix. Can be either ‘adjacency’ or ‘connectivity’.
alpha (float, optional (default: 0.33)) – The weight or fraction of the smoothed expression to use in the final expression matrix. The final expression matrix is computed as (1 - alpha) * X + alpha * (S @ X), where X is the original expression matrix and S is the smoothed matrix.
n_chunks (int, optional (default: None)) – The number of chunks to split the cells into for processing. If not provided, chunk_size is used.
chunk_size (int, optional (default: None)) – The size of each chunk of cells to process. If not provided, n_chunks is used.
use_raw (bool, optional (default: False)) – Whether to use the raw data stored in adata.raw.X.
disable_pbar (bool, optional (default: False)) – Whether to disable the progress bar.
- Returns:
The smoothed expression matrix is stored in adata.layers[key_added].
- Return type:
None
Notes
This function smoothes the expression values of single cells based on their K-nearest neighbors (KNNs) using the Scanpy package. The smoothing is performed by calculating a smoothing matrix S based on the nearest neighbor graph and then computing the smoothed expression as (1 - alpha) * X + alpha * (S @ X), where X is the original expression matrix.
The smoothing is performed in chunks to reduce memory usage. The number of chunks or the chunk size can be specified using the n_chunks or chunk_size parameters, respectively.
The smoothed expression matrix is stored in adata.layers[key_added].
- sccellfie.expression.get_global_mean_threshold(adata, lower_bound=1e-05, upper_bound=None, exclude_zeros=False, use_raw=False)[source]
Obtains the global mean threshold for each gene in a AnnData object.
- Parameters:
adata (AnnData object) – Annotated data matrix.
lower_bound (float or pandas.DataFrame, optional (default: 1e-5)) – Lower bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes
upper_bound (float or pandas.DataFrame, optional (default: None)) – Upper bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes
exclude_zeros (bool, optional (default: False)) – Whether to exclude zeros when computing the threshold.
use_raw (bool, optional (default: False)) – Whether to use the raw data stored in adata.raw.X.
- Returns:
thresholds – A pandas.DataFrame object with the global mean threshold for each gene.
- Return type:
- sccellfie.expression.get_global_trimean_threshold(adata, lower_bound=1e-05, upper_bound=None, exclude_zeros=False, use_raw=False)[source]
Obtains the global Tukey’s trimean threshold for each gene in a AnnData object.
- Parameters:
adata (AnnData object) – Annotated data matrix.
lower_bound (float or pandas.DataFrame, optional (default: 1e-5)) – Lower bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes as the adata object.
upper_bound (float or pandas.DataFrame, optional (default: None)) – Upper bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes as the adata object.
exclude_zeros (bool, optional (default: False)) – Whether to exclude zeros when computing the threshold.
use_raw (bool, optional (default: False)) – Whether to use the raw data stored in adata.raw.X.
- Returns:
thresholds – A pandas.DataFrame object with the global Tukey’s trimean threshold for each gene.
- Return type:
- sccellfie.expression.get_local_mean_threshold(adata, lower_bound=1e-05, upper_bound=None, exclude_zeros=False, use_raw=False)[source]
Obtains the local mean threshold for each gene in a AnnData object.
- Parameters:
adata (AnnData object) – Annotated data matrix.
lower_bound (float or pandas.DataFrame, optional (default: 1e-5)) – Lower bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes as the adata object.
upper_bound (float or pandas.DataFrame, optional (default: None)) – Upper bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes as the adata object.
exclude_zeros (bool, optional (default: False)) – Whether to exclude zeros when computing the threshold.
use_raw (bool, optional (default: False)) – Whether to use the raw data stored in adata.raw.X.
- Returns:
thresholds – A pandas.DataFrame object with the local mean threshold for each gene.
- Return type:
- sccellfie.expression.get_global_percentile_threshold(adata, percentile=0.75, lower_bound=1e-05, upper_bound=None, exclude_zeros=False, use_raw=False)[source]
Obtains the global percentile threshold for each gene in a AnnData object.
- Parameters:
adata (AnnData object) – Annotated data matrix.
percentile (float or list of floats, optional (default: 0.75)) – Percentile(s) to compute the threshold.
lower_bound (float or pandas.DataFrame, optional (default: 1e-5)) – Lower bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes as the adata object.
upper_bound (float or pandas.DataFrame, optional (default: None)) – Upper bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes as the adata object.
exclude_zeros (bool, optional (default: False)) – Whether to exclude zeros when computing the threshold.
use_raw (bool, optional (default: False)) – Whether to use the raw data stored in adata.raw.X.
- Returns:
thresholds – A pandas.DataFrame object with the global percentile threshold for each gene.
- Return type:
- sccellfie.expression.get_local_percentile_threshold(adata, percentile=0.75, lower_bound=1e-05, upper_bound=None, exclude_zeros=False, use_raw=False)[source]
Obtains the local percentile threshold for each gene in a AnnData object.
- Parameters:
adata (AnnData object) – Annotated data matrix.
percentile (float or list of floats, optional (default: 0.75)) – Percentile(s) to compute the threshold.
lower_bound (float or pandas.DataFrame, optional (default: 1e-5)) – Lower bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes as the adata object.
upper_bound (float or pandas.DataFrame, optional (default: None)) – Upper bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes as the adata object.
exclude_zeros (bool, optional (default: False)) – Whether to exclude zeros when computing the threshold.
use_raw (bool, optional (default: False)) – Whether to use the raw data stored in adata.raw.X.
- Returns:
thresholds – A pandas.DataFrame object with the local percentile threshold for each gene.
- Return type:
- sccellfie.expression.get_local_trimean_threshold(adata, lower_bound=1e-05, upper_bound=None, exclude_zeros=False, use_raw=False)[source]
Obtains the local Tukey’s trimean threshold for each gene in a AnnData object.
- Parameters:
adata (AnnData object) – Annotated data matrix.
lower_bound (float or pandas.DataFrame, optional (default: 1e-5)) – Lower bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes as the adata object.
upper_bound (float or pandas.DataFrame, optional (default: None)) – Upper bound for the threshold. If a pandas.DataFrame is provided, it must have the same number of genes as the adata object.
exclude_zeros (bool, optional (default: False)) – Whether to exclude zeros when computing the threshold.
use_raw (bool, optional (default: False)) – Whether to use the raw data stored in adata.raw.X.
- Returns:
thresholds – A pandas.DataFrame object with the local Tukey’s trimean threshold for each gene.
- Return type:
- sccellfie.expression.get_sccellfie_dataset_threshold(adata, gene_set=None, organism='human', cell_mask=None, layer=None, use_raw=False, target_sum=10000, n_counts_key=None, chunk_size=100000, reservoir_size=5000000, percentiles=(10, 25, 50, 75, 90, 95), lower_percentile=25, upper_percentile=75, random_state=None, verbose=True, return_stats=False)[source]
Computes a dataset-wise
sccellfie_thresholdper metabolic gene by streaming the AnnData in chunks. Faithful port of the atlas-based threshold script that produced the defaultThresholds.csv, generalized to a single (possibly backed) AnnData.- Pipeline per chunk:
CP10k-normalize using a per-cell library size (obs column or computed from the full chunk).
Subset to the corrected metabolic-gene columns (after applying
CORRECT_GENES[organism]).Accumulate per-gene sum, non-zero cell count, and max.
Stream non-zero normalized values into a reservoir sample for global percentiles.
- The final threshold rule matches the original script (with configurable bounds):
if max > P_lower or max == 0: threshold = clip(nonzero_mean, P_lower, P_upper) else: threshold = nonzero_mean
where
P_lower/P_upperdefault to P25 / P75 (the original atlas behavior) and are controlled bylower_percentile/upper_percentile.- Parameters:
adata (AnnData) – Annotated data matrix. May be backed (
sc.read_h5ad(..., backed='r')); chunks are materialized one at a time.gene_set (list, set, pandas.Index, str or None, optional (default: None)) – Metabolic gene list.
Noneloads the default gene list from the scCellFie database fororganism. A string ending in.jsonis treated as the path to a JSON file containing a list of gene symbols.organism (str, optional (default: 'human')) – Used to select the
CORRECT_GENESrename map and, ifgene_setis None, the scCellFie database to load metabolic genes from. Currently'human'or'mouse'.cell_mask (array-like, str or None, optional (default: None)) – Restricts the computation to a subset of cells. Accepts a boolean/integer array, a column name in
adata.obs, or apandas.Seriesindexed by cell names.layer (str or None, optional (default: None)) – Read from
adata.layers[layer]instead ofadata.X. Mutually exclusive withuse_raw.use_raw (bool, optional (default: False)) – Read from
adata.raw.X. Mutually exclusive withlayer.target_sum (float or None, optional (default: 10_000)) – Target library size for CP-normalization. Pass
Noneto skip normalization (e.g. when the input values are already on the desired scale).n_counts_key (str or None, optional (default: None)) – Column in
adata.obscontaining per-cell totals. If None, auto-detect among('total_counts', 'n_counts', 'raw_sum', 'nCount_RNA')and otherwise compute per-cell sums from the full-matrix chunk before gene subsetting.chunk_size (int, optional (default: 100_000)) – Number of cells processed per chunk.
reservoir_size (int, optional (default: 5_000_000)) – Size of the reservoir used to estimate global percentiles of non-zero normalized values. Memory cost is
reservoir_size * 4B(float32).percentiles (tuple of int, optional (default: (10, 25, 50, 75, 90, 95))) – Percentiles to report in the returned stats. Always merged with
{lower_percentile, upper_percentile}so the rule’s bounds are also available for inspection.lower_percentile (int or float, optional (default: 25 and 75)) – Percentile bounds used by the clip rule. The threshold for each gene is
clip(nonzero_mean, P_lower, P_upper)when the gene’s max value exceedsP_loweror is zero (the low-expression escape); otherwise the rawnonzero_meanis used. Must satisfy0 <= lower_percentile < upper_percentile <= 100. Defaults reproduce the original atlas-derivedsccellfie_thresholdexactly.upper_percentile (int or float, optional (default: 25 and 75)) – Percentile bounds used by the clip rule. The threshold for each gene is
clip(nonzero_mean, P_lower, P_upper)when the gene’s max value exceedsP_loweror is zero (the low-expression escape); otherwise the rawnonzero_meanis used. Must satisfy0 <= lower_percentile < upper_percentile <= 100. Defaults reproduce the original atlas-derivedsccellfie_thresholdexactly.random_state (int or None, optional (default: None)) – Seed for the reservoir sampler.
verbose (bool, optional (default: True)) – If True, print progress via tqdm.
return_stats (bool, optional (default: False)) – If True, also return a dict with intermediate statistics.
- Returns:
thresholds (pandas.DataFrame) – A DataFrame indexed by metabolic gene symbol with a single column
'sccellfie_threshold'. Ready to pass tocompute_gene_scores(which selects the first column positionally).stats (dict, only if
return_stats=True) – Dict with keyspercentiles,sum_per_gene,nnz_per_gene,max_per_gene,mean,nonzero_mean,n_cells,n_values_seen,reservoir_size_used.
- sccellfie.expression.set_manual_threshold(adata, threshold)[source]
Sets a threshold manually for each gene in a AnnData object.
- Parameters:
- Returns:
thresholds – A pandas.DataFrame object with the manual threshold for each gene.
- Return type:
External
- sccellfie.external.sccellfie_to_tensor(preprocessed_db, sample_key, celltype_key, score_type='metabolic_tasks', min_cells_per_group=1, agg_func='trimean', layer=None, gene_symbols=None, top_percent=10, exclude_zeros=False, use_raw=False, threshold=None, order_labels=None, sort_elements=True, context_order=None, fill_value=nan, verbose=True)[source]
Converts scCellFie scores to format compatible with cell2cell’s PreBuiltTensor constructor.
This function builds a 3D tensor with dimensions: [Contexts/Samples, Cell Types, Metabolic Features]
- Parameters:
preprocessed_db (dict) – Output from run_sccellfie_pipeline containing ‘adata’ with metabolic_tasks and/or reactions attributes.
sample_key (str) – Column name in adata.obs for grouping by samples/contexts.
celltype_key (str) – Column name in adata.obs for cell type annotations.
score_type (str, optional (default: 'metabolic_tasks')) – Which scCellFie scores to use. Options: ‘metabolic_tasks’, ‘reactions’.
min_cells_per_group (int, optional (default: 1)) – Minimum number of cells required per group (sample x celltype) to be included in analysis.
agg_func (str, optional (default: 'trimean')) – Aggregation function to apply within cell groups. Options: ‘mean’, ‘median’, ‘25p’, ‘75p’, ‘trimean’, ‘topmean’, ‘fraction_above’.
layer (str, optional (default: None)) – Layer name to use for aggregation. If None, uses the main .X matrix.
gene_symbols (str or list, optional (default: None)) – Specific features to include in analysis. If None, all features are used.
top_percent (float, optional (default: 10)) – Percentage of top values for ‘topmean’ aggregation (0-100).
exclude_zeros (bool, optional (default: False)) – Whether to exclude zeros when aggregating values.
use_raw (bool, optional (default: False)) – Whether to use raw data for aggregation.
threshold (float, optional (default: None)) – Expression threshold for ‘fraction_above’ aggregation.
order_labels (list, optional (default: None)) – Labels for each dimension of the tensor. Default: [‘Contexts’, ‘Cell Types’, ‘Metabolic Features’]
sort_elements (bool, optional (default: True)) – Whether to alphabetically sort elements in each dimension.
context_order (list, optional (default: None)) – Custom order for contexts. If provided, contexts won’t be sorted.
fill_value (float, optional (default: numpy.nan)) – Value to fill when a feature or cell type is missing in a context.
verbose (bool, optional (default: True)) – Whether to print information about the analysis.
- Returns:
prebuilt_tensor_args – A dictionary containing all arguments needed for PreBuiltTensor constructor: - ‘tensor’: numpy array with shape (n_contexts, n_celltypes, n_features) - ‘order_names’: list of lists with names for each dimension - ‘order_labels’: list of dimension labels - ‘mask’: mask for missing values (if applicable) - ‘loc_nans’: locations of NaN values
- Return type:
Notes
This function aggregates single-cell metabolic scores into cell type-level summaries across different contexts (samples, conditions, timepoints, etc.) and creates a tensor suitable for tensor decomposition analysis.
The aggregation is performed using scCellFie’s robust aggregation methods, which handle various statistical measures and can exclude zeros or use specific thresholds.
Examples
>>> # Convert scCellFie metabolic tasks to tensor format >>> tensor_args = sccellfie_to_tensor( ... preprocessed_db, ... sample_key='condition', ... celltype_key='cell_type', ... score_type='metabolic_tasks', ... agg_func='mean' ... ) >>> >>> # Create PreBuiltTensor >>> from cell2cell.tensor import PreBuiltTensor >>> tensor = PreBuiltTensor(**tensor_args)
- sccellfie.external.quick_markers(adata, cluster_key, cell_groups=None, layer=None, n_markers=10, fdr=0.01, express_cut=0.9, r_output=False)[source]
Identifies top N markers for each cluster in an AnnData object using a TF-IDF-based strategy. Implemented as in the SoupX library for R.
- Parameters:
adata (AnnData) – Annotated data matrix from Scanpy.
cluster_key (str) – Key in adata.obs for the cluster labels.
cell_groups (list, optional (default: None)) – List of cell groups to be compared in the analysis.
layer (str, optional (default: None)) – Layer to use for the analysis. If None, uses adata.X.
n_markers (int, optional (default: 10)) – Number of marker genes to return per cluster.
fdr (float, optional (default: 0.01)) – False discovery rate for the hypergeometric test.
express_cut (float, optional (default: 0.9)) – Value above which a gene is considered expressed.
r_output (bool, optional (default: False)) – Whether reporting the same exact column names as the SoupX version.
- Returns:
markers – A pandas.DataFrame with top N markers for each cluster and their statistics.
- Return type:
- sccellfie.external.filter_tfidf_markers(df, tf_col='tf', idf_col='idf', tfidf_threshold=None, tfidf_col='tf_idf', tf_ratio=None, second_best_tf_col='second_best_tf', group_col='cluster', second_best_group_col='second_best_cluster')[source]
Filters the top N markers for each cluster based on a hyperbolic curve fit to the TF-IDF values. Additional filtering can be applied based on the TF-IDF threshold and the ratio of the TF score to the second-best TF score.
- Parameters:
df (pandas.DataFrame) – DataFrame containing the marker data. See sccellfie.preprocessing.quick_markers for details.
tf_col (str, optional (default: 'tf')) – Column name for the Term Frequency (TF) values.
idf_col (str, optional (default: 'idf')) – Column name for the Inverse Document Frequency (IDF) values.
tfidf_threshold (float, optional (default: None)) – Threshold for the TF-IDF values. If provided, only markers with TF-IDF values above this threshold are kept. A value of 0.3 is recommended for most datasets.
tfidf_col (str, optional (default: 'tf_idf')) – Column name for the TF-IDF values. Used for filtering based on the TF-IDF threshold.
tf_ratio (float, optional (default: None)) – Threshold for the ratio of the TF score to the second-best TF score. If provided, only markers with a ratio above this threshold are kept. A value of 1.2 is recommended for most datasets.
second_best_tf_col (str, optional (default: 'second_best_tf')) – Column name for the second-best TF values. Used for filtering based on the TF ratio.
group_col (str, optional (default: 'cluster')) – Column name for the cluster labels. Used for filtering based on the TF ratio. This is to keep markers when the cluster equals the second-best cluster (very specific marker).
second_best_group_col (str, optional (default: 'second_best_cluster')) – Column name for the second-best cluster labels. Used for filtering based on the TF ratio. This is to keep markers when the cluster equals the second-best cluster (very specific marker).
- Returns:
filtered_df (pandas.DataFrame) – DataFrame containing the filtered markers.
theoretical_curve (tuple) – Tuple containing the x and y values of the theoretical hyperbolic curve.
- sccellfie.external.markers_to_dict(markers_df, n_markers=10, sort_by='tf_idf', cluster_col='cluster', gene_col='gene', ascending=False)[source]
Converts a markers DataFrame to a dictionary mapping cluster names to lists of marker genes.
- Parameters:
markers_df (pandas.DataFrame) – DataFrame containing marker data with cluster and gene information.
n_markers (int, optional (default: 10)) – Number of top markers to select per cluster.
sort_by (str, optional (default: 'tf_idf')) – Column name to sort markers by for each cluster.
cluster_col (str, optional (default: 'cluster')) – Column name containing cluster labels.
gene_col (str, optional (default: 'gene')) – Column name containing gene names.
ascending (bool, optional (default: False)) – Whether to sort in ascending order. Default is False (descending order for TF-IDF).
- Returns:
markers_dict – Dictionary mapping cluster names to lists of marker gene names. Keys are naturally sorted cluster names.
- Return type:
IO
- sccellfie.io.load_adata(folder, filename, reactions_filename=None, metabolic_tasks_filename=None, spatial_network_key='spatial_network', verbose=True)[source]
Loads an AnnData object and its scCellFie attributes from a folder.
- Parameters:
folder (str) – The folder to load the AnnData object.
filename (str) – The name of the file to load the AnnData object.
reactions_filename (str, optional (default: None)) – The name of the file (without extension) to load the reactions object. If None, the default name is filename_reactions.
metabolic_tasks_filename (str, optional (default: None)) – The name of the file (without extension) to load the metabolic_tasks object. If None, the default name is filename_metabolic_tasks.
spatial_network_key (str, optional (default: 'spatial_network')) – The key in adata.uns or a scCellFie_attribute.uns where the spatial knn graph is stored if exists.
verbose (bool, optional (default: True)) – Whether to print the file names that were loaded.
- Returns:
adata – Annotated data matrix. If scCellFie attributes are found, they are also loaded into adata.reactions and adata.metabolic_tasks.
- Return type:
AnnData object
- sccellfie.io.save_adata(adata, output_directory, filename, spatial_network_key='spatial_network', verbose=True)[source]
Saves an AnnData object and its scCellFie attributes to a folder.
- Parameters:
adata (AnnData object) – Annotated data matrix.
output_directory (str) – Directory to save the results (AnnData objects).
filename (str) – The name of the file to save the AnnData object. Do not include the file extension.
spatial_network_key (str, optional (default: 'spatial_network')) – The key in adata.uns or a scCellFie_attribute.uns where the spatial knn graph is stored.
verbose (bool, optional (default: True)) – Whether to print the file names that were saved.
- Returns:
The AnnData object is saved to folder/filename.h5ad. The scCellFie attributes are saved to:
reactions: folder/filename_reactions.h5ad.
metabolic_tasks: folder/filename_metabolic_tasks.h5ad.
- Return type:
None
- sccellfie.io.save_result_summary(results_dict, output_directory, prefix='')[source]
Save the result summary contained in a dictionary to CSV files.
- sccellfie.io.load_segmentation(filepath: str, cell_ids: ndarray | None = None, cell_id_col: str | None = None, vertex_x_col: str | None = None, vertex_y_col: str | None = None, output: str = 'geodataframe') gpd.GeoDataFrame | dict[source]
Load cell boundary polygons from a segmentation file.
Generic loader for any vertex-table format (one row per polygon vertex). Supports Xenium parquet, CSV.gz, CSV, TSV, and TSV.gz with auto-detection of column names.
- Parameters:
filepath (str) – Path to the cell boundaries file. Accepted extensions are
.parquet,.csv.gz,.csv,.tsv, and.tsv.gz.cell_ids (np.ndarray, optional (default: None)) – If provided, only load polygons for these cell IDs.
cell_id_col (str, optional (default: None)) – Column name for cell identifiers. Auto-detected if None. Tries
"cell_id","ID","id","cell_ID"in that order.vertex_x_col (str, optional (default: None)) – Column name for vertex x-coordinates. Auto-detected if None. Tries
"vertex_x","x_location","X".vertex_y_col (str, optional (default: None)) – Column name for vertex y-coordinates. Auto-detected if None. Tries
"vertex_y","y_location","Y".output ({"geodataframe", "dict"}, optional (default: "geodataframe")) – Return format.
"geodataframe"returns a GeoDataFrame indexed by cell ID withcentroid_x/centroid_ycolumns."dict"returns a mapping ofcell_id -> shapely.Polygon.
- Returns:
Cell boundary polygons in the requested format.
- Return type:
geopandas.GeoDataFrame or dict
- sccellfie.io.load_xenium_segmentation(filepath: str, cell_ids: ndarray | None = None, cell_id_col: str | None = None, vertex_x_col: str | None = None, vertex_y_col: str | None = None, output: str = 'geodataframe') gpd.GeoDataFrame | dict[source]
Load cell boundaries from a Xenium
cell_boundariesfile.Thin wrapper around
load_segmentation()kept for discoverability. Xeniumcell_boundariesfiles use the default auto-detected columns (cell_id,vertex_x,vertex_y) so this is equivalent to callingload_segmentation()directly.See
load_segmentation()for parameter and return documentation.
- sccellfie.io.load_segmentation_from_gdf(gdf, geometry_col: str = 'geometry')[source]
Prepare a pre-loaded GeoDataFrame for downstream plotting.
Adds
centroid_xandcentroid_ycolumns if missing.- Parameters:
gdf (geopandas.GeoDataFrame) – GeoDataFrame with polygon geometries.
geometry_col (str, optional (default: "geometry")) – Name of the geometry column.
- Returns:
Input GeoDataFrame with centroid columns added.
- Return type:
geopandas.GeoDataFrame
- sccellfie.io.read_xenium(data_dir: str | Path, slide_id: str | None = None, segmentation: str = 'cell', cluster_file: str | Path | bool | None = None, spatial_key: str = 'X_spatial', verbose: bool = True) AnnData[source]
Read a 10x Xenium output bundle into an AnnData.
- Parameters:
data_dir (str or Path) – Path to the Xenium bundle root, or to a directory containing one sub-directory per slide (in which case
slide_idselects the slide).slide_id (str, optional (default: None)) – Sub-directory under
data_dir. When None,data_diritself is treated as the bundle root.segmentation ({"cell", "nucleus"}, optional (default: "cell")) –
"cell"readscell_feature_matrix.h5and joins centroids fromcells.csv.gz."nucleus"readsnucleus_feature_matrix.h5adand pulls centroids from itsobscolumnsx_centroid/y_centroid.cluster_file (str, Path, False, or None, optional (default: None)) – Path to a cluster-assignment CSV (with columns
Barcode,Cluster). When None,analysis/clustering/gene_expression_graphclust/clusters.csvis auto-loaded if present. PassFalseto skip the lookup.spatial_key (str, optional (default: "X_spatial")) – Key under which centroids are stored in
adata.obsm. Defaults to scCellFie’s canonical key; pass"spatial"if you also wantscanpy.pl.spatialto find them.verbose (bool, optional (default: True)) – Print informational messages.
- Returns:
AnnData with centroid coordinates in
adata.obsm[spatial_key]and any cluster assignments inadata.obs['cluster'].- Return type:
anndata.AnnData
- sccellfie.io.read_visium(path: str | Path, *, count_file: str = 'filtered_feature_bc_matrix.h5', library_id: str | None = None, source_image_path: str | Path | None = None, is_hd: bool = False, hd_layout: str = 'detect', genome: str | None = None, load_images: bool = True) AnnData[source]
Read a 10x Visium / VisiumHD bundle into an AnnData.
Standard Visium and VisiumHD-bins layouts delegate to
scanpy.read_visium(). The VisiumHD-segmented layout (presence ofcell_segmentations.geojson) is handled by a custom branch that derives centroids and per-cell areas from the polygons, optionally merges nucleus areas fromnucleus_segmentations.geojson, and writes coordinates under bothobsm['spatial']andobsm['X_spatial'].- Parameters:
path (str or Path) – Path to the Visium bundle directory.
count_file (str, optional (default: "filtered_feature_bc_matrix.h5")) – Filename of the count matrix inside
path.library_id (str, optional (default: None)) – Identifier used as the key under
adata.uns['spatial']. When None it is read from the count file’s HDF5 attributes.source_image_path (str or Path, optional (default: None)) – Path to the high-resolution tissue image, recorded under
adata.uns['spatial'][library_id]['metadata']['source_image_path'].is_hd (bool, optional (default: False)) – Whether this is a VisiumHD bundle. Used together with
hd_layoutto dispatch to the right branch.hd_layout ({"detect", "bins", "segmented", "standard"}, optional (default: "detect")) – Force a specific HD layout.
"detect"(default) auto-detects:"segmented"ifcell_segmentations.geojsonis present, otherwise"bins"ifspatial/tissue_positions.parquetis present, otherwise"standard".genome (str, optional (default: None)) – Filter expression to genes within this genome (passed through to
scanpy.read_visium()).load_images (bool, optional (default: True)) – Whether to load hires/lowres tissue images.
- Returns:
AnnData with spatial information stored in standard scanpy format. For the segmented branch, also exposes
adata.obsm['X_spatial'](scCellFie convention).- Return type:
anndata.AnnData
Plotting
- sccellfie.plotting.plot_communication_network(ccc_scores, sender_col, receiver_col, score_col, score_threshold=None, panel_size=(12, 8), network_layout='spring', edge_color='magenta', edge_width=25, edge_arrow_size=20, edge_alpha=0.25, node_color='#210070', node_size=1000, node_alpha=0.9, node_label_size=12, node_label_alpha=0.7, node_label_offset=(0.05, -0.2), title=None, title_fontsize=14, ax=None, save=None, dpi=300, tight_layout=True, bbox_inches='tight')[source]
Plots a network of cell-cell communication. Edges represent communication scores between cells. These scores could be an overall communication score or a specific ligand-receptor pair score.
- Parameters:
ccc_scores (pandas.DataFrame) – DataFrame containing the cell-cell communication scores. It should contain columns for the sender cell, receiver cell, and the communication score.
sender_col (str) – Column name for the sender cell.
receiver_col (str) – Column name for the receiver cell.
score_col (str) – Column name for the communication score.
score_threshold (float, optional (default: None)) – Threshold for the communication score. If provided, only scores above this threshold are plotted.
panel_size (tuple, optional (default: (12, 8))) – Size of the plot panel. Only works if ax is None.
network_layout (str, optional (default: 'spring')) – Layout of the network graph. Should be either ‘spring’ or ‘circular’.
edge_color (str, optional (default: 'magenta')) – Color of the edges.
edge_width (float, optional (default: 25)) – Width of the edges.
edge_arrow_size (float, optional (default: 20)) – Size of the edge arrows.
edge_alpha (float, optional (default: 0.25)) – Transparency of the edges.
node_color (str, optional (default: '#210070')) – Color of the nodes.
node_size (int, optional (default: 1000)) – Size of the nodes.
node_alpha (float, optional (default: 0.9)) – Transparency of the nodes.
node_label_size (int, optional (default: 12)) – Font size of the node labels.
node_label_alpha (float, optional (default: 0.7)) – Transparency of the node labels.
node_label_offset (tuple, optional (default: (0.05, -0.2))) – Offset of the node labels.
title (str, optional (default: None)) – Title of the plot.
title_fontsize (int, optional (default: 14)) – Font size of the title.
ax (matplotlib.axes.Axes, optional (default: None)) – Axes object where the plot will be drawn. If None, a new figure is created.
save (str, optional (default: None)) – Filepath to save the plot. If None, the plot is not saved.
dpi (int, optional (default: 300)) – Resolution of the saved plot.
tight_layout (bool, optional (default: True)) – Whether to use tight layout for the plot.
bbox_inches (str, optional (default: 'tight')) – Bounding box in inches. Only used if save is provided.
- Returns:
fig (matplotlib.figure.Figure) – The matplotlib figure object.
ax (matplotlib.axes.Axes) – The matplotlib axes object.
- sccellfie.plotting.create_volcano_plot(de_results, effect_threshold=0.75, padj_threshold=0.05, cell_type=None, group1=None, group2=None, effect_col='cohens_d', effect_title="Cohen's d", wrapped_title_length=50, save=None, dpi=300, tight_layout=True)[source]
Creates a volcano plot for differential analysis results.
- Parameters:
de_results (pd.DataFrame) – A DataFrame containing the results of the differential analysis. Required columns: ‘feature’, ‘adj_p_value’, and the column specified in effect_col. Optional columns: ‘cell_type’, ‘group1’, ‘group2’.
effect_threshold (float, optional (default: 0.75)) – The threshold for the effect size (e.g., log2 fold change or Cohen’s d) to consider a variable significant.
padj_threshold (float, optional (default: 0.05)) – The threshold for the adjusted p-value to consider a variable significant.
cell_type (str, optional (default: None)) – The specific cell type to plot. If None and cell_type column exists, all cell types are plotted.
group1 (str, optional (default: None)) – The first group in the comparison. If None, all group1 values are included.
group2 (str, optional (default: None)) – The second group in the comparison. If None, all group2 values are included.
effect_col (str, optional (default: 'cohens_d')) – The column in de_results that contains the effect size values.
effect_title (str, optional (default: "Cohen's d")) – The title to use for the effect size in the plot.
wrapped_title_length (int, optional (default: 50)) – The maximum number of characters per line in the title.
save (str, optional (default: None)) – The file path to save the plot. If None, the plot is not saved. A file extension (e.g., ‘.png’) can be provided to specify the file format.
dpi (int, optional (default: 300)) – The resolution of the saved figure.
tight_layout (bool, optional (default: True)) – Whether to use tight layout for the plot.
- Returns:
A list of feature names that are considered significant based on the provided thresholds, sorted by effect size in ascending order. Returns an empty list if no significant features are found.
- Return type:
Notes
This function creates a volcano plot where: - x-axis represents the effect size (e.g., log2 fold change or Cohen’s d) - y-axis represents the -log10(adjusted p-value) - Gray points indicate non-significant features - Red points indicate significant features that pass both thresholds - Dashed lines indicate the significance thresholds
- sccellfie.plotting.create_comparative_violin(adata, significant_features, group1, group2, condition_key, celltype, cell_type_key, xlabel='Feature', ylabel='Metabolic Activity', title=None, wrapped_title_length=50, figsize=(16, 7), fontsize=10, violin_cut=0, palette=['coral', 'lightsteelblue'], lgd_bbox_to_anchor=(1.05, 1), lgd_loc='upper left', save=None, dpi=300, tight_layout=True)[source]
Compares features between two groups for a specific cell type in an AnnData object and creates a violin plot.
- Parameters:
adata (AnnData) – An AnnData object containing the data.
significant_features (list) – List of significant feature names from the volcano plot function, sorted by effect size.
group1 (str) – The name of the first group to compare.
group2 (str) – The name of the second group to compare.
condition_key (str) – The column name in adata.obs containing the condition information.
celltype (str) – The cell type to analyze.
cell_type_key (str) – The column name in adata.obs containing the cell type information.
xlabel (str, optional (default: 'Feature')) – The label for the x-axis.
ylabel (str, optional (default: 'Metabolic Activity')) – The label for the y-axis.
title (str, optional (default: None)) – The title for the plot. If None, a default title is used.
wrapped_title_length (int, optional (default: 50)) – The maximum number of characters per line in the title.
figsize (tuple, optional (default: (16, 7))) – The figure size.
fontsize (int, optional (default: 10)) – The font size for the labels and legend.
violin_cut (float, optional (default: 0)) –
- The cut parameter for the violin plot. Distance, in units of bandwidth,
to extend the density past extreme datapoints. Set to 0 to limit the violin within the data range.
palette (list, optional (default: ['coral', 'lightsteelblue'])) – The color palette for the plot. Each color corresponds to a group or condition.
lgd_bbox_to_anchor (tuple, optional (default: (1.05, 1))) – The bbox_to_anchor parameter for the legend.
lgd_loc (str, optional (default: 'upper left')) – The location of the legend.
save (str, optional (default: None)) – The file path to save the plot. If None, the plot is not saved.
dpi (int, optional (default: 300)) – The resolution of the saved figure.
tight_layout (bool, optional (default: True)) – Whether to use tight layout for the plot.
- Returns:
fig, ax – The matplotlib Figure and Axes objects for the plot.
- Return type:
matplotlib.pyplot.Figure, matplotlib.pyplot.Axes
- sccellfie.plotting.create_beeswarm_plot(df, x='log2FC', y='cell_type', cohen_threshold=0.5, pval_threshold=0.05, show_n_significant=True, logfc_threshold=1.0, title=None, title_fontsize=20, ticks_fontsize=14, labels_fontsize=16, condition1_color='#8B0000', condition2_color='#000080', ns_color='#808080', strip_size=4, strip_alpha=0.6, strip_jitter=0.2, lgd_fontsize=14, lgd_marker_size=12, lgd_frameon=False, lgd_loc='upper left', lgd_bbox_to_anchor=(1.1, 1), sort_lambda=None, figsize=(10, 12), save=None, dpi=300, tight_layout=True)[source]
Creates a beeswarm plot to visualize differential analysis results. X-axis represents the effect size (e.g., log2 fold change or Cohen’s d). Y-axis represents the cell types or any categorical variable.
- Parameters:
df (DataFrame) – A DataFrame containing the results of the differential analysis, using the function pairwise_differential_analysis. The DataFrame should have at lest the following columns: ‘cell_type’, ‘feature’, ‘group1’, ‘group2’, ‘log2FC’, ‘cohens_d’, ‘adj_p_value’.
x (str, optional (default: 'log2FC')) – The column in df to use as the x-axis.
y (str, optional (default: 'cell_type')) – The column in df to use as the y-axis.
cohen_threshold (float, optional (default: 0.5)) – The threshold for Cohen’s D to consider a feature significant.
pval_threshold (float, optional (default: 0.05)) – The threshold for the adjusted p-value to consider a feature significant.
show_n_significant (bool, optional (default: True)) – Whether to show the count of significant features per cell type.
logfc_threshold (float, optional (default: 1.0)) – The threshold for the log2 fold change to consider a feature significant.
title (str, optional (default: None)) – The title for the plot. If None, a default title is used.
title_fontsize (int, optional (default: 20)) – The font size for the title.
ticks_fontsize (int, optional (default: 14)) – The font size for the ticks.
labels_fontsize (int, optional (default: 16)) – The font size for the labels.
condition1_color (str, optional (default: '#8B0000')) – The color for the first condition.
condition2_color (str, optional (default: '#000080')) – The color for the second condition.
ns_color (str, optional (default: '#808080')) – The color for non-significant features.
strip_size (int, optional (default: 4)) – The size of the strip plot points.
strip_alpha (float, optional (default: 0.6)) – The transparency of the strip plot points.
strip_jitter (float, optional (default: 0.2)) – The amount of jitter to apply to the strip plot points.
lgd_fontsize (int, optional (default: 14)) – The font size for the legend.
lgd_marker_size (int, optional (default: 12)) – The size of the legend markers.
lgd_frameon (bool, optional (default: False)) – Whether to show the legend frame.
lgd_loc (str, optional (default: 'upper left')) – The location of the legend.
lgd_bbox_to_anchor (tuple, optional (default: (1.1, 1))) – The bbox_to_anchor parameter for the legend.
sort_lambda (function, optional (default: None)) – A lambda function to sort the y-axis values. If None, the values are sorted by the y-axis column.
figsize (tuple, optional (default: (10, 12))) – The figure size.
save (str, optional (default: None)) – The file path to save the plot. If None, the plot is not saved.
dpi (int, optional (default: 300)) – The resolution of the saved figure.
tight_layout (bool, optional (default: True)) – Whether to use tight layout for the plot.
- Returns:
fig, ax (matplotlib.pyplot.Figure, matplotlib.pyplot.Axes) – The matplotlib Figure and Axes objects for the plot.
sig_df (DataFrame) – The input DataFrame filtered to only include significant features. Index is set to ‘cell_type’ and ‘feature’.
- sccellfie.plotting.create_multi_violin_plots(adata, features, groupby, n_cols=4, figsize=(5, 5), ylabel=None, title=None, fontsize=10, rotation=90, wrapped_title_length=45, save=None, dpi=300, tight_layout=True, w_pad=None, h_pad=None, **kwargs)[source]
Plots a grid of violin plots for multiple genes in Scanpy, controlling the number of columns.
- Parameters:
adata (AnnData) – Annotated data matrix.
features (list of str) – List of feature names to plot. Should match names in adata.var_names.
groupby (str) – Key in adata.obs containing the groups to plot. For each unique value in this column, a violin plot will be generated.
n_cols (int, optional (default: 4)) – Number of columns in the grid.
figsize (tuple of float, optional (default: (5, 5))) – Size of each subplot in inches.
ylabel (str, optional (default: None)) – Label for the y-axis. If None, the label will be the variable name.
title (list of str, optional (default: None)) – List of labels for each feature. If None, the feature name will be used.
fontsize (int, optional (default: 10)) – Font size for the title and axis labels. The tick labels will be set to fontsize, while the title will be set to fontsize + 4. Ylabel will be set to fontsize + 2.
rotation (int, optional (default: 90)) – Rotation of the x-axis tick labels
wrapped_title_length (int, optional (default: 50)) – The maximum number of characters per line in the title.
save (str, optional (default: None)) – Filepath to save the figure. If not provided, the figure will be displayed.
dpi (int, optional (default: 300)) – Resolution of the saved figure.
tight_layout (bool, optional (default: True)) – Whether to use tight layout.
w_pad (float, optional (default: None)) – Width padding between subplots.
h_pad (float, optional (default: None)) – Height padding between subplots.
**kwargs (dict) – Additional arguments to pass to sc.pl.violin. For example, rotation can be used to rotate the x-axis labels.
- sccellfie.plotting.create_radial_plot(metabolic_df, task_info_df, cell_type=None, tissue=None, task_col='metabolic_task', category_col='System', value_col='scaled_trimean', tissue_col='tissue', cell_type_col='cell_type', figsize=(6, 6), title='Metabolic activities', palette='Dark2', title_fontsize=24, legend_fontsize=14, legend_loc='center left', legend_bbox_to_anchor=(1.1, 0.5), alpha_fill=0.25, alpha_bg=0.1, ylim=1.0, sort_by_value=False, ax=None, show_legend=True, save=None, dpi=300, bbox_inches='tight', tight_layout=True)[source]
Creates a radial plot of metabolic task activities grouped by category.
- Parameters:
metabolic_df (pandas.DataFrame) – DataFrame containing metabolic task activities. Typically, it corresponds to the ‘melted’ dataframe in the outputs from sccellfie.reports.summary.generate_report_from_adata(). Required columns: task_col, value_col, cell_type_col, tissue_col.
task_info_df (pandas.DataFrame) – DataFrame containing task categorization information. Required columns: task_col and category_col.
cell_type (str, optional (default: None)) – The specific cell type to plot. If None, the maximum activity across all cell types within the specified tissue is used.
tissue (str, optional (default: None)) – The specific tissue to plot. If None, all tissues are included.
task_col (str, optional (default: 'metabolic_task')) – The column name in metabolic_df containing task identifiers.
category_col (str, optional (default: 'System')) – The column name in task_info_df containing category information.
value_col (str, optional (default: 'scaled_trimean')) – The column name in metabolic_df containing activity values.
tissue_col (str, optional (default: 'tissue')) – The column name in metabolic_df containing tissue information.
cell_type_col (str, optional (default: 'cell_type')) – The column name in metabolic_df containing cell type information.
figsize (tuple, optional (default: (6, 6))) – The size of the figure. Only used if ax is None.
title (str, optional (default: 'Metabolic activities')) – The title for the plot. Set to None to disable the title.
palette (str, optional (default: 'Dark2)) – Name of a palette for coloring the categories of metabolic tasks.
title_fontsize (int, optional (default: 24)) – Font size for the title.
legend_fontsize (int, optional (default: 14)) – Font size for the legend.
legend_loc (str, optional (default: "center left")) – Location of the legend.
legend_bbox_to_anchor (tuple, optional (default: (1.1, 0.5))) – Position of the legend relative to the legend_loc.
alpha_fill (float, optional (default: 0.25)) – Alpha transparency for the filled areas.
alpha_bg (float, optional (default: 0.1)) – Alpha transparency for the background areas.
ylim (float, optional (default: 1.0)) – Limit value for the y-axis (radial direction). If None, the maximum value across all tasks is used instead.
sort_by_value (bool, optional (default: False)) – If True, tasks within each category are sorted by their value. If False, tasks are sorted alphabetically within each category.
ax (matplotlib.axes.Axes, optional (default: None)) – A matplotlib axes with polar projection to draw the plot on. If None, a new figure and axes are created.
show_legend (bool, optional (default: True)) – Whether to display the legend.
save (str, optional (default: None)) – The filepath to save the figure. If None, the figure is not saved.
dpi (int, optional (default: 300)) – The resolution of the saved figure.
bbox_inches (str, optional (default: 'tight')) – The bbox_inches parameter for saving the figure.
tight_layout (bool, optional (default: True)) – Whether to use tight layout for the plot. Only applied if ax is None.
- Returns:
fig (matplotlib.figure.Figure) – The matplotlib figure object.
ax (matplotlib.axes.Axes) – The matplotlib axes object.
Examples
>>> import pandas as pd >>> from sccellfie.plotting import create_radial_plot >>> >>> # Load example data >>> metabolic_df = pd.read_csv('Melted.csv') >>> task_info_df = pd.read_csv('TaskInfo.csv') >>> >>> # Create radial plot for maximum activities across all cell types in a tissue >>> fig, ax = create_radial_plot(metabolic_df, task_info_df, tissue='Blood') >>> plt.show() >>> >>> # Create radial plot for a specific cell type in a specific tissue >>> fig, ax = create_radial_plot(metabolic_df, task_info_df, cell_type='T cell', tissue='Blood') >>> plt.show() >>> >>> # Create multiple subplots with shared legend >>> fig = plt.figure(figsize=(20, 10)) >>> ax1 = fig.add_subplot(121, projection='polar') >>> ax2 = fig.add_subplot(122, projection='polar') >>> >>> # First subplot with legend >>> create_radial_plot(metabolic_df, task_info_df, tissue='Blood', ax=ax1, show_legend=True) >>> # Second subplot without legend >>> create_radial_plot(metabolic_df, task_info_df, tissue='Liver', ax=ax2, show_legend=False) >>> plt.tight_layout() >>> plt.show()
- sccellfie.plotting.plot_neighbor_distribution(results, figsize=(15, 8), save=None, dpi=300, bbox_inches='tight', tight_layout=True)[source]
Visualizes the neighbor distribution analysis results.
- Parameters:
results (dict) – Output from ´sccellfie.spatial.neighborhood.compute_neighbor_distribution´ function
figsize (tuple) – Figure size for the combined plots
save (str, optional (default: None)) – Filepath to save the figure.
dpi (int, optional (default: 300)) – Resolution of the saved figure.
bbox_inches (str, optional (default: 'tight')) – Bounding box in inches. Only used if save is provided.
tight_layout (bool, optional (default: True)) – Whether to use tight layout.
- Returns:
fig (matplotlib.figure.Figure) – The matplotlib figure object.
gs (matplotlib.gridspec.GridSpec) – The matplotlib gridspec object.
- sccellfie.plotting.plot_spatial(adata, keys, suptitle=None, suptitle_fontsize=20, title_fontsize=14, legend_fontsize=12, bkgd_label='H&E', wrapped_title_length=45, ncols=3, hspace=0.15, wspace=0.1, save=None, dpi=300, bbox_inches='tight', tight_layout=True, **kwargs)[source]
Plots spatial expression of multiple genes in Scanpy.
- Parameters:
adata (AnnData) – AnnData object containing gene expression and spatial information.
keys (list of str) – List of feature names to plot. Should match names in adata.var_names or a column in adata.obs.
suptitle (str, optional (default: None)) – Title for the entire figure.
suptitle_fontsize (int, optional (default: 20)) – Font size for the figure title.
title_fontsize (int, optional (default: 14)) – Font size for each subplot title (key name).
legend_fontsize (int, optional (default: 12)) – Font size for the legend elements.
hspace (float, optional (default: 0.1)) – Height space between subplots.
wspace (float, optional (default: 0.1)) – Width space between subplots.
bkgd_label (str, optional (default: 'H&E')) – Label for the background image.
wrapped_title_length (int, optional (default: 45)) – The maximum number of characters per line in the title.
ncols (int, optional (default: 3)) – Number of columns in the grid.
save (str, optional (default: None)) – Filepath to save the figure.
dpi (int, optional (default: 300)) – Resolution of the saved figure. Only used if save is provided.
bbox_inches (str, optional (default: 'tight')) – Bounding box in inches. Only used if save is provided.
tight_layout (bool, optional (default: True)) – Whether to use tight layout.
**kwargs (dict) – Additional arguments to pass to scanpy.pl.spatial.
- Returns:
fig (matplotlib.figure.Figure) – The matplotlib figure object.
axes (numpy.ndarray) – Array of matplotlib axes.
- sccellfie.plotting.plot_segmentation(adata, spatial_key: str = 'X_spatial', color_by: str | Sequence[str] | None = None, celltype_key: str = 'cell_type', segmentation: dict | None = None, cell_id_col: str | None = None, palette: dict | None = None, highlight: List[str] | None = None, layer: str | None = None, crop: Tuple[float, float, float, float] | None = None, invert_yaxis: bool = True, legend: bool = True, legend_loc: str = 'center left', legend_bbox: Tuple[float, float] | None = (1.01, 0.5), legend_frameon: bool = False, legend_title: str | None = None, legend_fontsize: float | None = 7.0, legend_ncol: int = 1, legend_params: dict | None = None, axes_off: bool = True, figsize: Tuple[float, float] | None = None, ax=None, ncols: int = 4, panel_titles: bool = True, title: str | Sequence[str] | None = None, title_fontsize: float | None = 12, wrapped_title_length: int = 45, dpi: int = 150, scatter_size: float = 2.0, cmap: str = 'viridis', vmin: float | None = None, vmax: float | None = None, y_pad_ratio: float = 0.1, x_pad_ratio: float = 0.0, scalebar: bool = True, scalebar_kwargs: dict | None = None, cbar_kwargs: dict | None = None, save: str | None = None)[source]
Plot cell-resolution spatial data from an AnnData object.
Renders cells as segmentation polygons when
segmentationis provided, otherwise as a centroid scatter plot. Supports categorical and continuous colouring, optional highlighting of a subset of categories, axis cropping, and a scalebar with bottom/top padding.When
color_byis a list, multiple panels are drawn in a grid laid out byncols(matchingsc.pl.spatialsemantics): the geometry, crop, and view limits are computed once and shared across panels; each panel is coloured independently and gets its own legend or colorbar.- Parameters:
adata (anndata.AnnData) – AnnData with spatial coordinates in
adata.obsm[spatial_key].spatial_key (str, optional (default: "X_spatial")) – Key in
adata.obsmfor the(n_cells, 2+)coordinate array. Defaults to scCellFie’s canonical key.color_by (str, list of str, or None, optional (default: None)) – Column in
adata.obsor name inadata.var_namesto colour by. If None, falls back tocelltype_key. Pass a list of names (e.g.["task_A", "task_B", "GENE1"]) to render a multi-panel figure with one panel per feature.celltype_key (str, optional (default: "cell_type")) – Default categorical column used when
color_byis None.segmentation (dict, optional (default: None)) – Mapping
cell_id -> shapely.Polygon(e.g. output ofsccellfie.io.load_segmentation()withoutput="dict"). If None, a scatter of centroids is drawn.cell_id_col (str, optional (default: None)) – Column in
adata.obsidentifying cells. Defaults toadata.obs.index.palette (dict, optional (default: None)) – Custom
{category: color}mapping for categorical colouring. Falls back toadata.uns["{color_by}_colors"]or matplotlibSet2cycling. In multi-panel mode the same palette is reused for every categorical feature.highlight (list of str, optional (default: None)) – Subset of categories to highlight; all others are drawn in
whitesmokeand excluded from the legend.layer (str, optional (default: None)) – Layer name in
adata.layersused whencolor_byis a gene. If None, usesadata.X.crop (tuple, optional (default: None)) –
(minx, miny, maxx, maxy)bounds to restrict the view. Data outside this box is not rendered. If None, uses data extent.invert_yaxis (bool, optional (default: True)) – Invert the y-axis (microscopy convention).
legend (bool, optional (default: True)) – Show the legend for categorical data, or a colorbar for continuous data.
legend_loc (str, optional (default: "center left")) –
locargument passed toax.legend(). Ignored for colorbar.legend_bbox (tuple, optional (default: (1.01, 0.5))) –
bbox_to_anchorfor the legend. UseNoneto disable the anchor and rely onlegend_localone.legend_frameon (bool, optional (default: False)) – Whether the legend frame/border is drawn.
legend_title (str, optional (default: None)) – Title shown above the legend entries.
legend_fontsize (float, optional (default: 7.0)) – Font size for legend labels.
Nonefalls back to the matplotlib default. The small default suits spatial plots with many categories; bump it vialegend_params={'fontsize': 10}(or the dedicated arg) when needed.legend_ncol (int, optional (default: 1)) – Number of columns in the legend.
legend_params (dict, optional (default: None)) – Arbitrary kwargs forwarded to
ax.legend(...)(e.g.handlelength,labelspacing,borderpad,columnspacing). Keys here override the dedicatedlegend_*arguments on conflict.axes_off (bool, optional (default: True)) – Remove ticks, tick labels, and spines (standard for spatial plots).
figsize (tuple, optional (default: None)) –
Single panel (
color_byis a str or None): the figure size, defaulting to(10, 10)when None.Multi panel (
color_byis a list): the per-panel size, defaulting to(4, 4)when None. The total figure size is(figsize[0] * ncols, figsize[1] * nrows).
Ignored when
axis provided.ax (matplotlib.axes.Axes, optional (default: None)) – Existing axes to draw onto. Only valid when
color_byis a single feature (or None). For multi-panel, omitaxand let the function build the grid.ncols (int, optional (default: 4)) – Number of columns in the panel grid when
color_byis a list. Number of rows isceil(len(color_by) / ncols). Mirrorssc.pl.spatial’sncolsparameter.panel_titles (bool, optional (default: True)) – Master toggle for panel titles. When True, each panel’s title is set to the corresponding feature name (or to the explicit string passed via
title=). Set False to suppress titles entirely (in single- and multi-panel modes).title (str, list of str, or None, optional (default: None)) – Explicit title override. For single-feature mode pass a string; for multi-feature mode pass a list of strings whose length matches
color_by. When None (default), titles are auto-derived from the feature names. Ignored ifpanel_titles=False.title_fontsize (float, optional (default: 12)) – Font size of the per-panel title. Mirrors the convention in
sccellfie.plotting.plot_spatial().wrapped_title_length (int, optional (default: 45)) – Maximum number of characters per title line. Long feature names (e.g. metabolic-task labels) are wrapped via
textwrap.wrap()before being set, matching the behavior of the other tool plots (plot_spatial(),create_multi_violin_plots(),create_volcano_plot()). Pass a large value (e.g. 1000) to disable wrapping.dpi (int, optional (default: 150)) – Figure DPI; also used when
saveis set.scatter_size (float, optional (default: 2.0)) – Marker size for centroid scatter mode.
cmap (str, optional (default: "viridis")) – Matplotlib colormap name for continuous colouring.
vmin (float, optional (default: None)) – Lower / upper bounds for continuous colouring. When set, values outside
[vmin, vmax]are clipped at the colormap edges and the colorbar is restricted to that range. Ignored for categorical colouring. In multi-panel mode the same bounds apply to every panel (useful for comparing features on a shared scale). Pass only one of the two to cap a single side.vmax (float, optional (default: None)) – Lower / upper bounds for continuous colouring. When set, values outside
[vmin, vmax]are clipped at the colormap edges and the colorbar is restricted to that range. Ignored for categorical colouring. In multi-panel mode the same bounds apply to every panel (useful for comparing features on a shared scale). Pass only one of the two to cap a single side.y_pad_ratio (float, optional (default: 0.1)) – Fraction of the y range added as top/bottom whitespace (so the scalebar label has room).
x_pad_ratio (float, optional (default: 0.0)) – Fraction of the x range added as left/right whitespace. Default keeps x tight to data — increase when the legend or a colorbar sits to the right of the plot and you want extra breathing room on the data side too.
scalebar (bool, optional (default: True)) – Draw a scalebar on every panel.
scalebar_kwargs (dict, optional (default: None)) – Overrides for the scalebar (e.g.
length,units,color,position,pad_frac,fontsize,text_pad_pts).pad_fracis the inset of the bar from the axes corner as a fraction of the axes height/width;text_pad_ptsis the gap (in points) between the bar and its label. The label is always placed on the side of the bar away from the data, so it never overlaps cells wheny_pad_ratio > 0.cbar_kwargs (dict, optional (default: None)) – Overrides passed to
plt.colorbarfor continuous colouring.save (str, optional (default: None)) – If given, save the figure to this path with
dpiandbbox_inches="tight".
- Returns:
fig (matplotlib.figure.Figure) – The matplotlib figure object.
ax (matplotlib.axes.Axes or numpy.ndarray of Axes) – Single Axes when
color_byis a string (or None); a 2D array of Axes (shape(nrows, ncols)) whencolor_byis a list.
Preprocessing
- sccellfie.preprocessing.get_adata_gene_expression(adata, gene, layer=None, use_raw=False)[source]
Get expression values for a given feature from AnnData object. Checks both adata.var_names (gene expression) and adata.obs (metadata).
- Parameters:
adata (AnnData) – AnnData object containing the expression data.
gene (str) – Name of the gene or feature to extract the expression values for.
layer (str, optional (default: None)) – Name of the layer to extract the expression values from. This layer has priority over adata.X and ´use_raw´.
use_raw (bool, optional (default: False)) – If True, use the raw data in adata.raw.X if available.
- Returns:
expression – Array containing the expression values for the specified gene.
- Return type:
- sccellfie.preprocessing.stratified_subsample_adata(adata, group_column, target_fraction=0.2, random_state=0)[source]
Stratified subsampling of an AnnData object.
- Parameters:
- Returns:
adata_subsampled – Subsampled AnnData object
- Return type:
AnnData
- sccellfie.preprocessing.normalize_adata(adata, target_sum=10000, n_counts_key='n_counts', chunk_size=None, copy=False)[source]
Memory-efficient normalization of AnnData object. Works directly on sparse matrices without converting to dense.
- Parameters:
adata (AnnData) – Annotated data matrix containing the expression data.
target_sum (int, optional (default: 10_000)) – The target sum to which the data will be normalized.
n_counts_key (str, optional (default: 'n_counts')) – The key in adata.obs containing the total counts for each cell.
chunk_size (int or None, optional (default: None)) – If None, process entire matrix at once (faster, more memory). If int, process matrix in chunks of this size (slower, less memory). Recommended for very large datasets (>1M cells).
copy (bool, optional (default: False)) – If True, returns a copy of adata with the normalized data.
- sccellfie.preprocessing.transform_adata_gene_names(adata, filename=None, organism='human', copy=True, drop_unmapped=False)[source]
Transforms gene names in an AnnData object from Ensembl IDs to gene symbols.
- Parameters:
adata (AnnData) – Annotated data matrix containing the expression data. All gene names must be in Ensembl ID format.
filename (str, optional) – The file path to a custom CSV file containing Ensembl IDs and gene symbols. One column must be ‘ensembl_id’ and the other ‘symbol’.
organism (str, optional (default: 'human')) – The organism to retrieve data for. Choose ‘human’ or ‘mouse’.
copy (bool, optional (default: True)) – If True, return a copy of the AnnData object. If False, modify the object in place.
drop_unmapped (bool, optional (default: False)) – If True, drop genes that could not be mapped to symbols.
- Returns:
The AnnData object with gene names transformed to gene symbols. If copy=True, this is a new object.
- Return type:
AnnData
- Raises:
ValueError – If not all genes in the AnnData object are in Ensembl ID format.
- sccellfie.preprocessing.transfer_variables(adata_target, adata_source, var_names, source_obs_col=None, target_obs_col=None, keep_sparse=True)[source]
Transfers variables from source AnnData to target AnnData, handling different sizes and maintaining sparse matrix format if needed.
- Parameters:
adata_target (AnnData) – Target AnnData object to add variables to.
adata_source (AnnData) – Source AnnData object to get variables from.
var_names (str or list) – Names of variables to transfer from ´adata_source´ to ´adata_target´.
source_obs_col (str, optional) – Column in source adata.obs to use for matching observations (e.g. column containing barcodes).
target_obs_col (str, optional) – Column in target adata.obs to use for matching observations (e.g. column containing barcodes).
keep_sparse (bool) – Whether to maintain sparse matrix format if present.
- Returns:
Updated target AnnData object with new variables
- Return type:
AnnData
- sccellfie.preprocessing.add_complexes_to_adata(adata, complexes, agg_method='min', layer=None, copy=False)[source]
Adds multi-gene complex expression as new variables in an AnnData object.
Computes per-cell aggregated expression for each complex and appends the result as new columns in adata.X. Layers are handled by computing the complex aggregation for the source layer and zero-filling others.
- Parameters:
adata (AnnData) – AnnData object containing expression data with individual gene expression.
complexes (dict) – Dictionary mapping complex names (str) to lists of subunit gene names. Example: {‘ITGA4&ITGB1’: [‘ITGA4’, ‘ITGB1’]}
agg_method (str, default='min') –
- Aggregation across subunits per cell. Options:
’min’ : Minimum expression (rate-limiting subunit).
’mean’ : Arithmetic mean expression.
’gmean’ : Geometric mean expression.
layer (str, optional) – Layer to read subunit expression from. If None, uses adata.X.
copy (bool, default=False) – If True, return a modified copy. If False, modify adata in place and return None.
- Returns:
If copy=True, returns the modified AnnData. Otherwise modifies adata in place and returns None.
- Return type:
AnnData or None
- Raises:
ValueError – If agg_method is not one of ‘min’, ‘mean’, ‘gmean’. If any subunit gene is not found in adata.var_names. If a complex name already exists in adata.var_names.
- sccellfie.preprocessing.make_complex_name(subunits, separator='&')[source]
Generates a canonical complex name from a list of subunit names.
- sccellfie.preprocessing.prepare_var_pairs(adata, var_pairs, complex_sep='&', agg_method='min', layer=None)[source]
Prepares variable pairs for communication scoring by detecting multi-element (complex) entries, adding them to adata, and returning normalized string-only pairs.
Each element in a var_pair can be either a string (single gene/task) or a list/tuple of strings (complex with multiple subunits). When a list is detected, the complex is automatically named by joining the sorted subunit names with complex_sep and added to adata via add_complexes_to_adata. Complexes already present in adata.var_names are skipped.
- Parameters:
adata (AnnData) – AnnData object containing expression data.
var_pairs (list of tuples) –
- List of (ligand, receptor) pairs where each element can be:
str: single gene or task name.
list/tuple of str: subunits of a complex.
Example:
var_pairs = [ (['TASK1', 'TASK2'], ['GENE1', 'GENE2']), # both complex ('TASK3', 'GENE4'), # both single ('TASK1', ['GENE5', 'GENE6']), # mixed ]
complex_sep (str, default='&') – Separator used to join subunit names into the complex name.
agg_method (str, default='min') – Aggregation method for complex subunits. See add_complexes_to_adata.
layer (str, optional) – Layer to read subunit expression from.
- Returns:
normalized_pairs – String-only (ligand, receptor) pairs ready for scoring functions. Complex elements are replaced by their generated names.
- Return type:
list of tuples
- sccellfie.preprocessing.get_element_associations(df, element, axis_element=0)[source]
Gets the tasks, reactions, or genes associated with a given element in the DataFrame.
- Parameters:
df (pandas.DataFrame) – DataFrame containing the associations.
element (str) – Element for which to get the associations. This can be a task, reaction, or gene. Name should match exactly the name in indexes or columns of the DataFrame.
axis_element (int, optional (default: 0)) – Axis along which the element is located. Can be 0 (rows) or 1 (columns).
- Returns:
associations – List of tasks, reactions, or genes associated with the given element.
- Return type:
- sccellfie.preprocessing.add_new_task(task_by_rxn, task_by_gene, rxn_by_gene, task_info, rxn_info, task_name, task_system, task_subsystem, rxn_names, gpr_hgncs, gpr_symbols)[source]
Adds a new task and their associated reactions and genes to the database.
- Parameters:
task_by_rxn (pandas.DataFrame) – DataFrame representing the relationship between tasks and reactions.
task_by_gene (pandas.DataFrame) – DataFrame representing the relationship between tasks and genes.
rxn_by_gene (pandas.DataFrame) – DataFrame representing the relationship between reactions and genes.
task_info (pandas.DataFrame) – DataFrame containing information about tasks, including the task name, system (major group of tasks), and subsystem (specific group of tasks).
rxn_info (pandas.DataFrame) – DataFrame containing information about reactions, including the reaction name, and the associated GPR rules in HGNC and symbol format.
task_name (str) – Name of the task to add.
task_system (str) – System (major group of tasks) to which the task belongs.
task_subsystem (str) – Subsystem (specific group of tasks) to which the task belongs.
rxn_names (list of str) – List of reaction names associated with the task.
gpr_hgncs (list of str) – List of GPR rules in HGNC format associated with the reactions. Order should match the order of the reaction names.
gpr_symbols (list of str) – List of GPR rules in symbol format associated with the reactions. Order should match the order of the reaction names.
- Returns:
task_by_rxn (pandas.DataFrame) – Updated DataFrame representing the relationship between tasks and reactions.
task_by_gene (pandas.DataFrame) – Updated DataFrame representing the relationship between tasks and genes.
rxn_by_gene (pandas.DataFrame) – Updated DataFrame representing the relationship between reactions and genes.
task_info (pandas.DataFrame) – Updated DataFrame containing information about tasks, including the task name, system (major group of tasks), and subsystem (specific group of tasks).
rxn_info (pandas.DataFrame) – Updated DataFrame containing information about reactions, including the reaction name, and the associated GPR rules in HGNC and symbol format.
- sccellfie.preprocessing.combine_and_sort_dataframes(df1, df2, preference='max')[source]
Combines two DataFrames and sort the rows and columns alphabetically.
- Parameters:
df1 (pandas.DataFrame) – First DataFrame to combine.
df2 (pandas.DataFrame) – Second DataFrame to combine.
preference (str, optional) – Preference for which value to keep when both dataframes have the same cell. Options: ‘max’ (default), ‘min’, ‘df1’, ‘df2’.
- Returns:
combined_df – Combined DataFrame with all rows and columns from df1 and df2, sorted alphabetically. Missing values are filled with 0.
- Return type:
- sccellfie.preprocessing.handle_duplicate_indexes(df, value_column=None, operation='first')[source]
Handles duplicated indexes in a DataFrame by keeping the min, max, mean, first, or last value associated with them in a specified column.
- Parameters:
df (pandas.DataFrame) – DataFrame with duplicated indexes.
value_column (str, optional (default: None)) –
- Name of the column containing values to make a decision
when handling duplicated indexes. This value is optional only when operation is ‘first’ or ‘last’.
operation (str, optional (default: 'first')) – Operation to perform when handling duplicated indexes. Options: ‘min’, ‘max’, ‘mean’, ‘first’, ‘last’.
- Returns:
df_result – DataFrame with duplicated indexes handled according to the specified operation
- Return type:
- sccellfie.preprocessing.clean_gene_names(gpr_rule)[source]
Removes spaces between parentheses and gene IDs in a GPR rule.
- sccellfie.preprocessing.find_genes_gpr(gpr_rule)[source]
Finds all gene IDs in a GPR rule.
- sccellfie.preprocessing.replace_gene_ids_in_gpr(gpr_rule, gene_id_mapping)[source]
Replaces gene IDs in a GPR rule with new IDs (different nomenclature).
- sccellfie.preprocessing.convert_gpr_nomenclature(gpr_rules, id_mapping)[source]
Converts gene IDs in multiple GPR rules to a different nomenclature.
- sccellfie.preprocessing.get_matrix_gene_expression(matrix, var_names, gene, normalize=False)[source]
Safely extracts expression values for a gene from any matrix type.
- Parameters:
matrix (numpy.ndarray) – The matrix containing the expression data. Rows correspond to cells and columns to genes.
var_names (list or pandas.Index) – The index or array containing the gene names.
gene (str) – The gene name to extract.
normalize (bool, optional (default: False)) – If True, apply min-max normalization to the expression values.
- Returns:
expression – An array containing the expression values for the specified gene.
- Return type:
- sccellfie.preprocessing.min_max_normalization(df, axis=0)[source]
Applies min-max normalization along specified axis.
- Parameters:
df (pandas.DataFrame or array-like) – The input DataFrame to be normalized.
axis (int, optional (default: 0)) – The axis along which to normalize. Use 0 to normalize each column or 1 to normalize each row using their cognate min and max values.
- Returns:
df_scaled – A DataFrame containing the normalized values. Minimum and maximum values are calculated along the specified axis. Minimum and maximum values are 0 and 1, respectively. NaN values are filled with 0.
- Return type:
- sccellfie.preprocessing.compute_dataframes_correlation(df1, df2, col_name=None, method='spearman')[source]
Computes correlations between one column in ´df1´ and all columns in another ´df2´.
- Parameters:
df1 (pandas.DataFrame) – DataFrame of which one column will be correlated against multiple columns in df2.
df2 (pandas.DataFrame) – DataFrame containing multiple columns to correlate against the single column in df1.
col_name (str, optional (default: None)) – The name of the column in df1 to correlate against df2. If None, the first column in df1 is used.
method (str, optional (default: 'spearman')) – The correlation method to use. Either ‘pearson’ or ‘spearman’.
- Returns:
DataFrame with correlation coefficients for each column in multi_column_df
- Return type:
- sccellfie.preprocessing.preprocess_inputs(adata, gpr_info, task_by_gene, rxn_by_gene, task_by_rxn, correction_organism='human', gene_fraction_threshold=0.0, reaction_fraction_threshold=0.0, verbose=True)[source]
Preprocesses inputs for metabolic analysis.
- Parameters:
adata (AnnData) – Annotated data matrix.
gpr_info (pandas.DataFrame) – DataFrame containing reaction IDs and their corresponding Gene-Protein-Reaction (GPR) rules.
task_by_gene (pandas.DataFrame) – DataFrame representing the relationship between tasks and genes.
rxn_by_gene (pandas.DataFrame) – DataFrame representing the relationship between reactions and genes.
task_by_rxn (pandas.DataFrame) – DataFrame representing the relationship between tasks and reactions.
correction_organism (str, optional (default: 'human')) – Organism of the input data. This is important to correct gene names that are present in scCellFie’s or custom database. Check options in sccellfie.preprocessing.prepare_inputs.CORRECT_GENES.keys()
gene_fraction_threshold (float, optional (default: 0.0)) – The minimum fraction of genes in a reaction’s GPR that must be present in adata to keep the reaction. Range is 0 to 1. 1.0 means all genes must be present. Any value > 0 and < 1 keeps reactions with at least that fraction of genes present. 0 means keep reactions with at least one gene present.
reaction_fraction_threshold (float, optional (default: 0.0)) – The minimum fraction of reactions in a task that must be present after gene filtering to keep the task. Range is 0 to 1. 1.0 means all reactions must be present. Any value > 0 and < 1 keeps tasks with at least that fraction of reactions present. 0 means keep tasks with at least one reaction present.
verbose (bool, optional (default: True)) – If True, prints information about the preprocessing results.
- Returns:
adata2 (AnnData) – Filtered annotated data matrix.
gpr_rules (dict) – Dictionary of GPR rules for the filtered reactions.
task_by_gene (pandas.DataFrame) – Filtered DataFrame representing the relationship between tasks and genes.
rxn_by_gene (pandas.DataFrame) – Filtered DataFrame representing the relationship between reactions and genes.
task_by_rxn (pandas.DataFrame) – Filtered DataFrame representing the relationship between tasks and reactions.
Reports
- sccellfie.reports.compute_dataset_completeness(adata, gpr_source, task_by_rxn, ablation_impact=None, reaction_impact=None, metric='fraction_zeroed', threshold=1.0, disable_pbar=True)[source]
Evaluate dataset completeness relative to a metabolic task database, at both essential-gene and all-gene scopes, in a single pass.
“Missing” at the dataset level means the gene symbol does not appear in adata.var_names, i.e. the assay does not cover it. This is a property of the dataset as a whole, not of any particular cell.
- Parameters:
adata (AnnData) – The user’s expression data; only adata.var_names is consulted.
gpr_source (dict) – Either {reaction_id: cobra.core.gene.GPR} (as returned by sccellfie.preprocessing.prepare_inputs.preprocess_inputs) or {reaction_id: str} of raw GPR strings.
task_by_rxn (pandas.DataFrame) – Rows are tasks, columns are reactions; non-zero where reaction participates in task.
ablation_impact (dict of DataFrames, optional) – Output of sccellfie.stats.compute_gene_ablation_impact at the task level. If None, it is computed internally.
reaction_impact (dict of DataFrames, optional) – Reaction-level ablation (each reaction treated as its own task). Computed internally when None.
metric (str, optional (default: 'fraction_zeroed')) – Impact metric used when deriving essential-gene sets via essential_genes_from_ablation. One of ‘rel_change’, ‘abs_change’, ‘fraction_zeroed’.
threshold (float, optional (default: 1.0)) – Threshold paired with metric for essential-gene derivation.
disable_pbar (bool, optional (default: True)) – Forwarded to internal compute_gene_ablation_impact calls when impacts are computed here.
- Returns:
- Three flat DataFrames with dual essential/all scopes as suffixed columns:
’task_completeness’ : one row per task.
’reaction_completeness’ : one row per reaction.
’overall_summary’ : single row with aggregate stats.
- Return type:
- sccellfie.reports.compute_cell_completeness(adata, gpr_source, task_by_rxn, ablation_impact=None, metric='fraction_zeroed', threshold=1.0, layer=None, write_to_obs=True, obs_key_prefix='completeness_', return_matrix=False, disable_pbar=True)[source]
Per-cell completeness relative to the metabolic-task database, at both essential-gene and all-gene scopes.
“Missing” for a given cell and gene means either (a) the gene is absent from adata.var_names (dataset-absent, constant across cells) or (b) the gene is in adata.var_names but has expression == 0 in that cell. Missing genes contribute their rel_change impact on each task; per-cell per-task completeness is 1 - clip(sum of impacts, 0, 1). The final per-cell score aggregates across tasks via the mean.
- Parameters:
adata (AnnData) – Expression data. adata.X (or adata.layers[layer] if provided) is used to determine which genes are zero in which cells.
gpr_source (dict) – As in compute_dataset_completeness.
task_by_rxn (pandas.DataFrame) – Tasks x reactions membership matrix.
ablation_impact (dict of DataFrames, optional) – Task-level ablation output. Computed internally if None.
metric (see compute_dataset_completeness.)
threshold (see compute_dataset_completeness.)
layer (str, optional) – Layer in adata.layers from which to read expression. Defaults to adata.X.
write_to_obs (bool, optional (default: True)) – If True, writes adata.obs[obs_key_prefix + ‘essential’] and adata.obs[obs_key_prefix + ‘all’].
obs_key_prefix (str, optional (default: ‘completeness_’)) – Prefix for the obs columns when write_to_obs=True.
return_matrix (bool, optional (default: False)) – If True, also return the dense (cell x task) per-scope completeness matrices.
disable_pbar (bool, optional (default: True)) – Forwarded to internal compute_gene_ablation_impact call when impact is computed here.
- Returns:
‘per_cell’ : DataFrame(cells x [‘completeness_essential’, ‘completeness_all’])
’matrix_essential’ : DataFrame(cells x tasks) or None
’matrix_all’ : DataFrame(cells x tasks) or None
- Return type:
- sccellfie.reports.generate_completeness_report(adata, gpr_source, task_by_rxn, ablation_impact=None, reaction_impact=None, metric='fraction_zeroed', threshold=1.0, layer=None, write_to_obs=True, obs_key_prefix='completeness_', return_matrix=False, disable_pbar=True)[source]
Run both compute_dataset_completeness and compute_cell_completeness in one call and return {‘dataset’: …, ‘cell’: …}. Shares the same ablation impact across both sub-reports to avoid duplicate computation.
- sccellfie.reports.generate_report_from_adata(adata, group_by, agg_func='trimean', layer=None, features=None, tissue_col=None, feature_name='feature', min_cells=1, threshold=np.float64(3.4657359027997265), default_tissue_name='tissue', **kwargs)[source]
Process AnnData object and calculate metrics for each group (e.g., cell type).
- Parameters:
adata (AnnData) – AnnData object containing the expression data.
group_by (str) – Column name in adata.obs for the groups (e.g., cell types).
agg_func (str, optional (default: 'trimean')) – The aggregation function to apply. Options are ‘mean’, ‘median’, ‘25p’ (25th percentile), ‘75p’ (75th percentile), ‘trimean’ (0.5*Q2 + 0.25(Q1+Q3)), and ‘topmean’ (computed among the top `top_percent`% of values).
layer (str, optional (default: None)) – Name of the layer in adata to use. If None, uses adata.X.
features (list, optional (default: None)) – Names of features to analyze. If None, uses adata.var_names.
tissue_col (str, optional (default: None)) – Column name in adata.obs for tissue information.
feature_name (str, optional (default: 'feature')) – Name to use for features in melted results (e.g., ‘metabolic_task’, ‘reaction’).
min_cells (int, optional (default: 1)) – Minimum number of cells required for a group to be included in the analysis.
threshold (float, optional (default: 5*np.log(2))) – Threshold value for counting cells passing expression threshold.
default_tissue_name (str, optional (default: 'tissue')) – Default tissue name to use when tissue_column is not provided.
**kwargs (dict) – Additional arguments to pass to the aggregation function.
- Returns:
- Dictionary containing DataFrames for each metric:
agg_values: Aggregated values (e.g., trimean) per group
variance: Variance values per group
std: Standard deviation values per group
threshold_cells: Number of cells passing threshold per group
nonzero_cells: Number of non-zero cells per group
cell_counts: Number of cells per group
min_max: Min/max values for features
melted: Melted version of all metrics
- Return type:
Spatial
Stats
- sccellfie.stats.compute_gene_ablation_impact(gpr_source, task_by_rxn, genes=None, uniform_score=1.0, disable_pbar=False)[source]
Simulate single-gene ablation on a synthetic uniform-expression reference and measure per-task impact.
For each gene, set its gene_score to 0 (leaving every other gene at uniform_score), re-evaluate every reaction whose GPR contains the gene, then recompute metabolic-task scores using the same arithmetic as sccellfie.metabolic_task.compute_mt_score.
- Parameters:
gpr_source (dict) – Either {reaction_id: cobra.core.gene.GPR} (as returned by sccellfie.preprocessing.prepare_inputs.preprocess_inputs) or {reaction_id: str} of raw GPR strings (parsed internally via cobra.core.gene.GPR().from_string).
task_by_rxn (pandas.DataFrame) – Rows are metabolic tasks, columns are reactions. Cell (T, r) is non-zero iff reaction r participates in task T.
genes (list of str, optional (default: None)) – Subset of genes to ablate. Default uses the union of all genes across the GPRs. Genes not appearing in any GPR contribute an all-zero row.
uniform_score (float, optional (default: 1.0)) – Positive score assigned to every non-ablated gene. Exposed mainly for testing; rel_change and fraction_zeroed are scale-invariant, while abs_change scales linearly with this value.
disable_pbar (bool, optional (default: False)) – Disable the per-gene progress bar.
- Returns:
- Three (gene x task) DataFrames keyed by:
- ’rel_change’(baseline_mts - ablated_mts) / baseline_mts, in [0, 1].
1.0 means the gene fully zeros the task under uniform reference.
’abs_change’ : baseline_mts - ablated_mts.
’fraction_zeroed’: 1 iff ablated_mts == 0 and baseline_mts > 0, else 0.
- Return type:
Notes
Under a single-cell uniform reference every reaction’s baseline RAL equals 5*log(1 + uniform_score/uniform_score) = 5*log(2) when reached through compute_gene_scores, but here we call the GPR walker directly on gene scores (not raw expression), so baseline RAL and baseline MTS are both equal to uniform_score exactly (min/max of constant uniform_score values). This is a property of the walker, not of the gene_score transform.
- sccellfie.stats.compute_reaction_topology_essentiality(task_by_rxn, cobra_model, task_endpoints, treat_reversible_as_bidirectional=True, ignore_metabolites=None)[source]
For each task with a user-supplied (start_metabolite, end_metabolite), flag reactions that are essential for connecting start -> end through the task’s metabolite graph.
The graph has metabolites as nodes and reactions as edges. For each reaction in the task that is present in the cobra Model, an edge is added from every substrate to every product. When treat_reversible_as_bidirectional is True, reversible reactions also contribute reverse edges. Optionally, metabolites in ignore_metabolites are excluded from the graph (useful for currency metabolites like ATP/ADP/H+/H2O).
A reaction is essential iff removing all of its edges disconnects start_met from end_met. Tasks without a specified endpoint pair are skipped (their column in the output is all False).
- Parameters:
task_by_rxn (pandas.DataFrame) – Rows are tasks, columns are reactions, non-zero where the reaction participates in the task.
cobra_model (cobra.Model) – Genome-scale metabolic model whose reaction IDs and metabolite IDs match those used in task_by_rxn.
task_endpoints (dict[str, tuple[str, str]]) – {task_name: (start_metabolite_id, end_metabolite_id)}. Only tasks listed here are evaluated; others get an all-False column.
treat_reversible_as_bidirectional (bool, optional (default: True)) – If True, reactions with rxn.reversibility == True contribute edges in both directions. If False, edges follow the nominal substrate -> product direction only.
ignore_metabolites (set of str, optional (default: None)) – Metabolite IDs to exclude as graph nodes. Edges that would use any of them are not added.
- Returns:
(reactions x tasks) boolean DataFrame. True at (r, T) iff removing reaction r disconnects the start -> end path in task T. Rows are indexed by all reaction IDs in task_by_rxn.columns.
- Return type:
- sccellfie.stats.essential_genes_from_ablation(impact, metric='fraction_zeroed', threshold=1.0, topology=None, task_by_rxn=None, gpr_source=None, fallback_to_ablation_only=True)[source]
Derive per-task essential-gene lists from the ablation impact output, optionally filtered by a reaction-level topology essentiality DataFrame.
- Parameters:
impact (dict[str, pandas.DataFrame] or pandas.DataFrame) – Output from compute_gene_ablation_impact, or one of its DataFrames.
metric (str, optional (default: 'fraction_zeroed')) – Which impact DataFrame to threshold when impact is a dict. Must be one of ‘rel_change’, ‘abs_change’, ‘fraction_zeroed’.
threshold (float, optional (default: 1.0)) – A gene is flagged essential for task T iff impact[metric].loc[g, T] >= threshold.
topology (pandas.DataFrame, optional (default: None)) – (reactions x tasks) boolean DataFrame from compute_reaction_topology_essentiality. When provided, a gene is essential only if, in addition to clearing the threshold, at least one of the reactions it appears in (for that task) is marked essential by the topology.
task_by_rxn (pandas.DataFrame, optional) – Required when topology is provided. Defines each task’s reaction membership.
gpr_source (dict, optional) – Required when topology is provided. Same format as compute_gene_ablation_impact (GPR objects or strings). Used to map genes to reactions.
fallback_to_ablation_only (bool, optional (default: True)) – When topology is provided but a given task’s column is all-False (not evaluated, no endpoints, or missing in the model): if True, fall back to the plain ablation threshold for that task. If False, yield [] for that task.
- Returns:
{task_name: sorted list of essential genes}.
- Return type:
- sccellfie.stats.cohens_d(group1, group2)[source]
Calculates Cohen’s d effect size for two groups.
- Parameters:
group1 (array-like) – Values from the first group of samples.
group2 (array-like) – Values from the second group of samples.
- Returns:
d – Cohen’s d effect size.
- Return type:
- sccellfie.stats.scanpy_differential_analysis(adata, cell_type, cell_type_key, condition_key, condition_pairs=None, var_names=None, alpha=0.05, min_cells=30, downsample=False, n_iterations=50, agg_method='mean', random_state=None)[source]
Performs differential expression analysis using Scanpy’s rank_genes_groups function.
- Parameters:
adata (AnnData object) – Annotated data matrix containing the expression data.
cell_type (str or None) – The cell type to analyze. If None, analysis is performed for all cell types.
cell_type_key (str) – The column name in adata.obs containing the cell type information.
condition_key (str) – The column name in adata.obs containing the condition information.
condition_pairs (list of tuples, optional (default: None)) – The pairs of conditions to compare. If None, all pairs of conditions are compared.
var_names (list of str, optional (default: None)) – The list of variable names (e.g. genes) to perform the differential expression analysis on. If None, all genes are used.
alpha (float, optional (default: 0.05)) – The significance level for the multiple testing correction.
min_cells (int, optional (default: 30)) – Minimum number of cells required in each group for comparison.
downsample (bool, optional (default: False)) – Whether to perform downsampling to balance group sizes.
n_iterations (int, optional (default: 50)) – Number of subsampling iterations if downsample=True.
agg_method (str, optional (default: 'mean')) – Method to aggregate results across iterations (‘mean’ or ‘median’).
random_state (int, optional (default: None)) – Random seed for reproducibility of downsampling.
- Returns:
df_results –
- A DataFrame containing the results of the differential expression analysis with columns:
cell_type: The analyzed cell type
feature: Name of the analyzed feature
group1: First condition in the comparison
group2: Second condition in the comparison
log2FC: Log2 fold change between means of conditions
test_statistic: Wilcoxon test statistic
p_value: Raw p-value
adj_p_value: BH-corrected p-value
cohens_d: Effect size (Cohen’s d)
n_group1: Number of observations in group1
n_group2: Number of observations in group2
median_group1: Median expression in group1
median_group2: Median expression in group2
median_diff: Difference in medians (group2 - group1)
- Return type:
- sccellfie.stats.pairwise_differential_analysis(adata, groupby, var_names=None, order=None, alternative='two-sided', alpha=0.05)[source]
Performs pairwise Wilcoxon tests for each feature between all group pairs. This functions does not perform the test in a cell type-wise manner. For that, use ´scanpy_differential_analysis´.
- Parameters:
adata (AnnData) – AnnData object containing the expression data.
groupby (str) – Column in adata.obs containing group labels.
var_names (list, optional (default: None)) – List of feature names to test. If None, all features are tested.
order (list, optional (default: None)) – Specific order of groups to test. If None, groups are sorted.
alternative (str, optional (default: 'two-sided')) – Alternative hypothesis for the Wilcoxon rank-sum test. Options are ‘two-sided’, ‘greater’, ‘less’.
alpha (float, optional (default: 0.05)) – Significance level for multiple testing correction.
- Returns:
df – A DataFrame containing the results with the same columns as scanpy_differential_analysis (except ‘cell_type’) for consistency:
feature: Name of the analyzed feature
group1: First condition in the comparison
group2: Second condition in the comparison
log2FC: Log2 fold change between conditions
test_statistic: Wilcoxon test statistic
p_value: Raw p-value
adj_p_value: BH-corrected p-value
cohens_d: Effect size (Cohen’s d)
n_group1: Number of observations in group1
n_group2: Number of observations in group2
median_group1: Median expression in group1
median_group2: Median expression in group2
median_diff: Difference in medians (group2 - group1)
- Return type:
- sccellfie.stats.generate_pseudobulks(adata, cell_type_key, n_pseudobulks=5, cells_per_bulk=1000, layer=None, use_raw=False, genes=None, agg_func='trimean', continuous_key=None, random_seed=None)[source]
Generates pseudo-bulk samples from single-cell data. Each pseudo-bulk represents a group of cells from the same cell type.
- Parameters:
adata (AnnData) – An AnnData object containing the single-cell expression data.
cell_type_key (str) – The key in adata.obs that contains the cell type annotations.
n_pseudobulks (int, optional (default: 5)) – The number of pseudo-bulk samples to generate for each cell type. Less will be generated if there are fewer cells than the n_pseudobulks * cells_per_bulk.
cells_per_bulk (int, optional (default: 1000)) – The number of cells to include in each pseudo-bulk sample. Less will be used if there are fewer cells in the cell type.
layer (str, optional (default: None)) – The name of the layer in adata to use for aggregation. If None, the main expression matrix adata.X is used.
use_raw (bool, optional (default: False)) – Whether to use the data in adata.raw.X (True) or in adata.X (False).
genes (list, optional (default: None)) – List of gene names to include in the pseudo-bulk samples. If None, all genes are included.
agg_func (str, optional (default: 'trimean')) – The aggregation function to apply. Options are ‘mean’, ‘median’, ‘25p’ (25th percentile), ‘75p’ (75th percentile), ‘trimean’ (0.5*Q2 + 0.25(Q1+Q3)), and ‘topmean’ (computed among the top `top_percent`% of values).
continuous_key (str, optional (default: None)) – The key in adata.obs that contains continuous values to include in the pseudo-bulk samples. If None, continuous values are not included. This is useful for trajectory analysis or other continuous annotations.
random_seed (int, optional (default: None)) – Random seed for reproducible pseudo-bulk generation.
- Returns:
adata_pseudobulk – An AnnData object containing the pseudo-bulk samples. The expression values are aggregated across the cells in each pseudo-bulk. The obs DataFrame contains the cell type annotations and the continuous values if provided.
- Return type:
AnnData
- sccellfie.stats.fit_gam_model(adata, cell_type_key, cell_type_order=None, continuous_key=None, genes=None, layer=None, use_raw=False, n_splines=10, spline_order=3, lam=0.6, normalize=False, use_pseudobulk=False, n_pseudobulks=5, cells_per_bulk=1000, pseudobulk_agg='trimean', **kwargs)[source]
Fits Generalized Additive Models (GAMs) to single-cell data for each gene.
- Parameters:
adata (AnnData) – An AnnData object containing the single-cell expression data.
cell_type_key (str) – The key in adata.obs that contains the cell type annotations.
cell_type_order (list, optional (default: None)) – The order in which to process cell types. If None, cell types are processed in alphabetical order. This is useful when you have a known biological order for the cell types.
continuous_key (str, optional (default: None)) – The key in adata.obs that contains continuous values to include in the GAM models. If None, continuous values are not included. This is useful for trajectory analysis or other continuous annotations.
genes (list, optional (default: None)) – List of gene names to include in the GAM models. If None, all genes are included.
layer (str, optional (default: None)) – The name of the layer in adata to use for aggregation. If None, the main expression matrix adata.X is used.
use_raw (bool, optional (default: False)) – Whether to use the data in adata.raw.X (True) or in adata.X (False).
n_splines (int, optional (default: 10)) – Number of splines to use for the feature function in the GAM. Must be non-negative.
spline_order (int, optional (default: 3)) – Order of spline to use for the feature function in the GAM. Must be non-negative.
lam (float, optional (default: 0.6)) – Strength of smoothing penalty in the GAM. Must be a positive float. Larger values enforce stronger smoothing.
normalize (bool, optional (default: False)) – Whether to normalize the expression values for each gene. This normalization is of the type min-max scaling, where the minimum and maximum values are 0 and 1.
use_pseudobulk (bool, optional (default: False)) – Whether to use pseudobulk samples for the GAM analysis. If True, the GAM models are fitted to the aggregated expression values for each cell type. This is useful for reducing the biased on the statistical power due to having many single cells.
n_pseudobulks (int, optional (default: 5)) – The number of pseudo-bulk samples to generate for each cell type.
cells_per_bulk (int, optional (default: 1000)) – The number of cells to include in each pseudo-bulk sample.
pseudobulk_agg (str, optional (default: 'trimean')) – The aggregation function to apply when generating the pseudo-bulk samples. Options are ‘mean’, ‘median’, ‘25p’ (25th percentile), ‘75p’ (75th percentile), ‘trimean’ (0.5*Q2 + 0.25(Q1+Q3)), and ‘topmean’ (computed among the top `top_percent`% of values).
kwargs (dict, optional) – Additional keyword arguments to pass to the GAM model. You can find more about it in the pygam documentation: https://pygam.readthedocs.io/en/latest/api/gam.html.
- Returns:
result – A dictionary containing the fitted GAM models, the model scores, and additional information about the pseudo-bulk assignments and cell type encoder when applicable.
- Return type:
- sccellfie.stats.analyze_gam_results(gam_results, significance_threshold=0.05, fdr_level=0.05)[source]
Analyzes GAM model results with FDR correction using statsmodels.
- Parameters:
gam_results (dict) – A dictionary containing the results of the GAM analysis. It should contain the ‘scores’ key with a DataFrame of model scores for each gene.
significance_threshold (float, optional (default: 0.05)) – The significance threshold to consider a gene as significant.
fdr_level (float, optional (default: 0.05)) – The False Discovery Rate (FDR) level to correct for multiple testing.
- Returns:
results_df – A DataFrame containing the model scores for each gene, along with the adjusted p-values and significance based on the significance threshold and FDR level.
- Return type:
- sccellfie.stats.get_task_determinant_genes(adata, metabolic_task, task_by_rxn, groupby=None, group=None, min_activity=0.0)[source]
Finds the genes that determine the activity of all reactions in a metabolic task. Returns determinant genes for each reaction and their activity across specified cell groups, along with the fraction of cells in each group where the gene was determinant.
- Parameters:
adata (AnnData object) – Annotated data matrix.
metabolic_task (str) – Name of the metabolic task to analyze. Must be one of the tasks in the task_by_rxn DataFrame. It must also be present in the adata.metabolic_tasks attribute.
task_by_rxn (pandas.DataFrame) – A pandas.DataFrame object where rows are metabolic tasks and columns are reactions. Each cell contains ones or zeros, indicating whether a reaction is involved in a metabolic task.
groupby (str, optional (default: None)) – The key in the adata.obs DataFrame to group by. This could be any categorical annotation of cells (e.g., cell type, cluster).
group (str or list, optional (default: None)) – The group(s) in the adata.obs DataFrame to analyze. If None, the analysis is performed by treating all single cells as a group. If groups is specified, groupby must be specified. The column referred by groupby must contain the groups specified in group.
min_activity (float, optional (default: 0.0)) – Minimum reaction activity level to consider a reaction as active. Only genes that are associated with active reactions are considered. If zero, all reactions and therefore all genes are considered.
- Returns:
df – A pandas.DataFrame reporting the determinant genes for each reaction in the metabolic task. The DataFrame has the following columns:
Group: The cell group.
Rxn: The reaction.
Det-Gene: The determinant gene for the reaction.
RAL: The reaction activity level for the reaction.
Cell_fraction: The fraction of cells in the group where this gene was determinant.
- Return type:
Notes
This function assumes that reaction activity levels have been computed using sccellfie.reaction_activity.compute_reaction_activity() and are stored in adata.reactions.X.
Scores are computed as previously indicated in the CellFie paper (https://doi.org/10.1016/j.crmeth.2021.100040).