Tensor Factorization (multiple conditions)
This notebook shows how to find patterns of metabolic activities across multiple samples/conditions simultaneously. For this, we use a non-negative tensor factorization. Briefly, tensor factorization is an unsupervised approach that relies on data organized as tensors, in this case a 3D tensor where dimensions corresponds to samples/conditions, cell types, and metabolic tasks. If you are interested, we recommened reading this review.
To illustrate this application application, the dataset we are using here includes BALF samples from donors with varying severities of COVID-19 (Liao et al 2020).
This tutorial includes following steps:
Loading libraries
[1]:
import pandas as pd
import numpy as np
import scanpy as sc
import sccellfie
import cell2cell as c2c
from tqdm import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import marsilea.plotter as mp
import marsilea as ma
import glasbey
/home/jovyan/my-conda-envs/single_cell/lib/python3.10/site-packages/dask/dataframe/__init__.py:31: FutureWarning: The legacy Dask DataFrame implementation is deprecated and will be removed in a future version. Set the configuration option `dataframe.query-planning` to `True` or None to enable the new Dask Dataframe implementation and silence this warning.
warnings.warn(
2025-09-03 18:44:11.561269: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1756925051.584280 1125 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1756925051.591379 1125 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-09-03 18:44:11.619524: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2]:
import mpl_fontkit as fk
fk.install("Lato")
fk.set_font("Lato")
Font name: `Lato`
Loading COVID-19 data
We start loading our single-cell data. In this case, adata.X already contains raw counts, which are the main inputs of scCellFie.
[3]:
adata = c2c.datasets.balf_covid('BALF-COVID19.h5ad')
Run scCellFie
Now we run scCellFie on the raw data.
[4]:
batch_key = 'sample' # Specify batch_key or leave as None
[5]:
results = sccellfie.run_sccellfie_pipeline(adata, # Raw counts
organism='human',
sccellfie_data_folder=None,
n_counts_col='n_counts', # Column where total counts per cells are stored in adata.obs
process_by_group=False, # Whether to do the processing by cell groups
groupby=None, # Column indicating cell groups if `process_by_group=True`
neighbors_key='neighbors', # Neighbors information if precomputed. Otherwise, it will be computed here
n_neighbors=10, # Number of neighbors to use
batch_key=batch_key, # None if batches are not included
threshold_key='sccellfie_threshold', # This is for using the default database. If personalized thresholds are used, specificy column name
smooth_cells=True, # Whether to perform gene expression smoothing before running the tool
alpha=0.33, # Importance of neighbors' expression for the smoothing (0 to 1)
chunk_size=5000, # Number of chunks to run the processing steps (helps with the memory)
disable_pbar=False,
save_folder=None, # In case results will be saved. If so, results will not be returned and should be loaded from the folder (see sccellfie.io.load_data function
save_filename=None # Name for saving the files, otherwise a default name will be used
)
==== scCellFie Pipeline: Initializing ====
Loading scCellFie database for organism: human
==== scCellFie Pipeline: Processing entire dataset ====
---- scCellFie Step: Preprocessing data ----
/home/jovyan/my-conda-envs/single_cell/lib/python3.10/site-packages/sccellfie/preprocessing/adata_utils.py:130: UserWarning: n_counts not found in adata.obs. Calculating total counts.
warnings.warn(f"{n_counts_key} not found in adata.obs. Calculating total counts.", UserWarning)
/home/jovyan/my-conda-envs/single_cell/lib/python3.10/site-packages/sccellfie/preprocessing/adata_utils.py:151: UserWarning: Normalizing data.
warnings.warn("Normalizing data.", UserWarning)
---- scCellFie Step: Computing neighbors ----
/home/jovyan/my-conda-envs/single_cell/lib/python3.10/site-packages/scanpy/preprocessing/_highly_variable_genes.py:73: UserWarning: `flavor='seurat_v3'` expects raw count data, but non-integers were found.
warnings.warn(
/home/jovyan/my-conda-envs/single_cell/lib/python3.10/site-packages/sccellfie/sccellfie_pipeline.py:414: FutureWarning: The specified parameters ('basis',) are no longer positional. Please specify them like `basis=True`
sc.external.pp.harmony_integrate(bdata, batch_key, verbose)
---- scCellFie Step: Preparing inputs ----
Gene names corrected to match database: 22
Shape of new adata object: (63103, 930)
Number of GPRs: 787
Shape of tasks by genes: (218, 930)
Shape of reactions by genes: (787, 930)
Shape of tasks by reactions: (218, 787)
---- scCellFie Step: Smoothing gene expression ----
Smoothing Expression: 100%|██████████| 13/13 [00:04<00:00, 3.08it/s]
/home/jovyan/my-conda-envs/single_cell/lib/python3.10/site-packages/sccellfie/expression/smoothing.py:188: ImplicitModificationWarning: Setting element `.layers['smoothed_X']` of view, initializing view as actual.
adata.layers[key_added] = smoothed_matrix
---- scCellFie Step: Computing gene scores ----
---- scCellFie Step: Computing reaction activity ----
Cell Rxn Activities: 100%|██████████| 63103/63103 [05:38<00:00, 186.58it/s]
---- scCellFie Step: Computing metabolic task activity ----
Removed 1 metabolic tasks with zeros across all cells.
==== scCellFie Pipeline: Processing completed successfully ====
Generate tensor
Once we have the results we can start generating the tensor for the factorization. However, we first need to preprocess the results.
Min-max normalization
In this case, we min-max-normalize the metabolic task activities across single cells:
[6]:
mt_original = results['adata'].metabolic_tasks.copy()
[7]:
results['adata'].metabolic_tasks.X = np.clip((mt_original.X - mt_original.X.min (axis=0))/(mt_original.X.max(axis=0) - mt_original.X.min (axis=0)),
a_min=0.,
a_max=1.0)
Tensor creation
Conveniently, scCellFie can be connected with cell2cell to handle tensors and perform the tensor factorization. This function below is designed to generate the inputs that cell2cell will use to generate a PreBuiltTensor object containing all the functions to perform our analysis:
[8]:
tensor_data = sccellfie.external.sccellfie_to_tensor(results,
sample_key='sample_new',
celltype_key='celltype',
score_type='metabolic_tasks',
min_cells_per_group=5,
agg_func='trimean',
use_raw=False,
order_labels=None,
sort_elements=True,
context_order=None,
fill_value=np.nan,
verbose=True
)
Using metabolic_tasks with 217 features and 63103 cells
Aggregating metabolic_task scores using 'trimean' method...
Processing contexts: 100%|██████████| 12/12 [00:06<00:00, 1.82it/s]
Building tensor with dimensions:
Contexts: 12
Cell Types: 10
Features: 217
Building tensor: 100%|██████████| 12/12 [00:00<00:00, 40.98it/s]
Tensor built successfully!
Shape: (12, 10, 217)
Non-zero elements: 19530
Fill value: nan
Next we use cell2cell to generate the tensor
[9]:
tensor = c2c.tensor.PreBuiltTensor(**tensor_data)
Tensor metadata
To plot the results later, we generate metadata useful for coloring tasks and samples/conditions by their major groups.
First, we generate groups for the metabolic tasks, based on the "System" column in the scCellFie database (results['task_info']):
[10]:
task_mapper = results['task_info'].set_index('Task')['System'].to_dict()
[11]:
met_functions = {task: task_mapper[task] for task in tensor.order_names[2]}
And for the samples, based on the disease condition (here, COVID-19 severity):
[12]:
def context_mapper(context):
if 'HC' in context: return 'Control'
elif 'M' in context: return 'Moderate'
elif 'S' in context: return 'Severe'
else: return 'NA'
[13]:
sample_groups = {context: context_mapper(context) for context in tensor.order_names[0]}
Then, create the metadata object:
[14]:
meta_tf = c2c.tensor.generate_tensor_metadata(interaction_tensor=tensor,
metadata_dicts=[sample_groups, None, met_functions],
fill_with_order_elements=True
)
Sort tensor elements by ordered categories
Then, based on this metadata, we can sort the tensor by the categories and their elements:
[15]:
meta_tf[2].sort_values(['Category', 'Element'], inplace=True)
[16]:
tensor = c2c.tensor.subset_tensor(tensor, subset_dict={2: meta_tf[2]['Element'].values.tolist()})
Tensor factorization
Number of factors
To select the number of factors for the tensor factorization, we could do it either manually or automatically from an Elbow analysis.
By selecting the parameter automatic_elbow=True we automatically select the number of factors.
[17]:
elbow, error = tensor.elbow_rank_selection(upper_rank=25,
runs=1, # This can be increased for more robust results
init='random',
automatic_elbow=True,
random_state=888,
)
100%|██████████| 25/25 [01:39<00:00, 3.96s/it]
The rank at the elbow is: 6
Running tensor factorization
Since we are relying on cell2cell, this is easily implemented. In this case, it performs, by default, a non-negative CANDECOMP/PARAFAC tensor factorization:
[18]:
tensor.compute_tensor_factorization(rank=tensor.rank,
init='svd',
random_state=888)
Main results of the factorization can be found in tensor.tl_object and tensor.factors. The latter is a dictionary containing the loadings for elements in each dimensions per factor, which are key to understand the identified patterns.
[19]:
tensor.factors.keys()
[19]:
odict_keys(['Contexts', 'Cell Types', 'Metabolic Tasks'])
Tensor factorization results
Again, cell2cell facilitates running the tensor factorization, but also visualizing results. We first start with bar plots for the loadings of each element across tensor dimensions:
Bar plots
[20]:
fig, ax = c2c.plotting.tensor_factors_plot(tensor, metadata=meta_tf, meta_cmaps=['inferno', 'tab10', 'tab20'], fontsize=14, filename='BarPlots-Factors-MetabolicTasks.pdf')
### Top metabolic tasks
We can also obtain the top tasks for each factor (sorted by their loadings):
[21]:
for i in range(1, tensor.rank+1):
print(tensor.get_top_factor_elements('Metabolic Tasks', factor_name=f'Factor {i}', top_number=5))
print('')
Methionine degradation 0.155069
Arachidonate synthesis 0.141260
Glutamine degradation 0.136161
Synthesis of palmitoyl-CoA 0.134918
Synthesis of malonyl-coa 0.132144
Name: Factor 1, dtype: float64
Link between glyoxylate metabolism and pentose phosphate pathway (Xylulose to glycolate) 0.185962
Synthesis of globoside (link with globoside metabolism) 0.153227
Synthesis of galactosyl glucosyl ceramide (link with ganglioside metabolism) 0.153225
Synthesis of glucocerebroside 0.140957
Phosphatidyl-inositol synthesis 0.136437
Name: Factor 2, dtype: float64
Conversion of glutamate to glutamine 0.230501
Phosphatidyl-serine synthesis 0.159930
Phosphatidyl-ethanolamine synthesis 0.159077
Phosphatidyl-choline synthesis 0.158241
Sphingomyelin synthesis 0.150424
Name: Factor 3, dtype: float64
Glutamine synthesis 0.151190
Asparagine synthesis 0.148869
Palmitate synthesis 0.139537
Hydroxymethylglutaryl-CoA synthesis 0.130992
Threonine degradation 0.130597
Name: Factor 4, dtype: float64
Phosphatidyl-serine synthesis 0.155484
Phosphatidyl-ethanolamine synthesis 0.154186
Phosphatidyl-choline synthesis 0.152877
Sphingomyelin synthesis 0.151348
Ceramide synthesis 0.148104
Name: Factor 5, dtype: float64
Synthesis of galactosyl glucosyl ceramide (link with ganglioside metabolism) 0.149196
Triacylglycerol synthesis 0.145805
Synthesis of glucocerebroside 0.145125
Synthesis of globoside (link with globoside metabolism) 0.142846
Phosphatidyl-choline synthesis 0.141422
Name: Factor 6, dtype: float64
Heatmaps
Then, we can select metabolic tasks with the higher loadings by setting thresholds:
[22]:
df = tensor.factors['Metabolic Tasks']
[23]:
_ = plt.hist(df.values.flatten(), bins = 20)
[24]:
loading_threshold = 0.125
With these thresholds, we can plot the high-loading metabolic tasks across factors. In addition we can visualize the key cells types to have an idea of what are the patterns capturing.
In this case, we use Marsilea for the visualizations:
[27]:
fontsize = 12
## CELL TYPES HEATMAP
# Data
df_plot = tensor.factors['Cell Types'].T
data = df_plot.values
vlimit = np.max(np.abs(data))
h1 = ma.Heatmap(data, linewidth=1, vmin=0, vmax=vlimit, height=2, width=2, cmap='YlOrBr',
cbar_kws={'title' : 'Cell Loadings', 'orientation' : 'horizontal'},
)
h1.add_left(mp.Labels(list(df_plot.index), fontsize=fontsize), pad=.05)
h1.add_bottom(mp.Labels(list(df_plot.columns), fontsize=fontsize, rotation=90), pad=.1)
h1.add_top(mp.Chunk([''], ['#24668C'], fontsize=fontsize-2, rotation=0, ha='center'), pad=0.05, name='Groups', legend=True)
h1.add_dendrogram("top", method='ward', metric='euclidean', colors=['#24668C'])
h1.add_title('Cell Types')
## METABOLIC TASKS HEATMAP
# Data
df_plot = df[(df.T > loading_threshold).any()].T
data = df_plot.values
vlimit = np.max(np.abs(data))
# Task grouping
task_categories = [task_mapper[t] for t in df_plot.columns]
task_cats = sorted(set(task_categories))
palette = glasbey.extend_palette('Dark2', palette_size=max(len(task_cats), 10))
h2 = ma.Heatmap(data, linewidth=1, vmin=0, vmax=vlimit, height=2, width=10, cmap='YlOrBr',
cbar_kws={'title' : 'Task Loadings', 'orientation' : 'horizontal'},
)
h2.group_cols(task_categories, order=task_cats)
#h2.add_left(mp.Labels(list(df_plot.index), fontsize=fontsize), pad=.05)
h2.add_top(mp.Chunk(len(task_cats)*[''], palette, fontsize=fontsize-2, rotation=0,
label='Metabolic Task Groups'), pad=0.01, name='Groups', legend=True)
h2.add_dendrogram("top", method='ward', metric='euclidean', colors=palette)
h2.add_bottom(mp.Labels(list(df_plot.columns), fontsize=fontsize, rotation=90), pad=.1)
h2.add_title('Metabolic Tasks')
h = (h1 + 0.2 + h2)
h.add_legends(stack_size=3, stack_by='col')
h.render()
h.save('Heatmap-Factors-MetabolicTasks.pdf', dpi=300)