Metabolic Markers

In this tutorial, we will walk you through how to identify metabolic task markers from single-cell data. We will use different approaches to detect important metabolic tasks that characterize distinct cell types in your dataset.

Here, we will use the results we previously generated for the Human Endometrial Cell Atlas (HECA) dataset (Mareckova & Garcia-Alonso et al 2023) by running our Quick Start Tutorial.

Loading libraries

[1]:

import sccellfie
import scanpy as sc

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import glasbey

import textwrap

## To avoid warnings
import warnings
warnings.filterwarnings("ignore")

In addition, we set up a folder to save our figures. This folder is stored in the settings of Scanpy:

[2]:

sc.settings.figdir = './results/marker-figures/'

Loading endometrium results

We start opening the results previously generated by running the scCellFie pipeline on the HECA dataset. If you haven’t run the pipeline yet, please follow this tutorial.

In this case, we will load the objects that were present in results['adata'] in that tutorial. This object contains:

results['adata']: contains gene expression in .X.
results['adata'].layers['gene_scores']: contains gene scores as in the original CellFie paper.
results['adata'].uns['Rxn-Max-Genes']: contains determinant genes for each reaction per cell.
results['adata'].reactions: contains reaction scores in .X so every scanpy function can be used on this object to visualize or compare values.
results['adata'].metabolic_tasks: contains metabolic task scores in .X so every scanpy function can be used on this object to visualize or compare values.

Here, we will name this object directly as adata. Each of the previous elements should be under adata., as for example adata.metabolic_tasks.

[3]:

adata = sccellfie.io.load_adata(folder='./results/',
                                filename='Human_HECA_scCellFie'
                               )

./results//Human_HECA_scCellFie.h5ad was correctly loaded
./results//Human_HECA_scCellFie_reactions.h5ad was correctly loaded
./results//Human_HECA_scCellFie_metabolic_tasks.h5ad was correctly loaded

In this case, we are interested in the metabolic task scores, which can be found in the adata.metabolic_tasks AnnData object.

[4]:

adata.metabolic_tasks

[4]:

AnnData object with n_obs × n_vars = 90001 × 215
    obs: 'n_genes', 'sample', 'percent_mito', 'n_counts', 'Endometriosis_stage', 'Endometriosis', 'Hormonal treatment', 'Binary Stage', 'Stage', 'phase', 'dataset', 'Age', 'lineage', 'celltype', 'label_long'
    uns: 'Binary Stage_colors', 'Biopsy_type_colors', 'Endometrial_pathology_colors', 'Endometriosis_stage_colors', 'GarciaAlonso_celltype_colors', 'Group_colors', 'Hormonal treatment_colors', 'Library_genotype_colors', 'Mareckova_celltype_colors', 'Mareckova_epi_celltype_colors', 'Mareckova_lineage_colors', 'Processing_colors', 'Symbol_colors', 'Tan_cellsubtypes_colors', 'Tan_celltype_colors', 'Treatment_colors', 'celltype_colors', 'dataset_colors', 'genotype_colors', 'hvg', 'label_long_colors', 'leiden', 'leiden_R_colors', 'leiden_colors', 'lineage_colors', 'neighbors', 'normalization', 'phase_colors', 'umap'
    obsm: 'X_scVI', 'X_umap'
    obsp: 'connectivities', 'distances'

Defining cell types

For metabolic marker identification, we need to specify which column in the AnnData object contains the cell type information:

[5]:

# Our column indicating the cell types in the adata.obs dataframe
cell_group = 'celltype'

Which can be visualized as

[17]:

sc.pl.umap(adata, color=cell_group, frameon=False)

../_images/notebooks_metabolic_markers_16_0.png

Identification of metabolic markers

To identify markers, we can use different approaches. Here we show an approach based on TF-IDF, which comes from the Natural Language Processing field, and the logistic regression implemented in Scanpy.

Detection using TF-IDF

scCellFie implements a TF-IDF (Term Frequency-Inverse Document Frequency) approach, adapted from the SoupX tool in R.

We define express_cut=5*np.log(2) to determine which cells are expressing each metabolic task. This value (5*np.log(2)) indicates that all reactions in the task are over the threshold value of the determinant gene (or “active”). This condition is often rare in single-cell data due to data sparsity (rarely all reactions in a task are over their determinant gene’s threshold), but it is a good approach for identifying markers.

[7]:

mrks = sccellfie.external.quick_markers(adata.metabolic_tasks,
                                        cluster_key=cell_group,
                                        n_markers=20,
                                        express_cut=5*np.log(2))

[8]:

mrks.head()

[8]:

	gene	cluster	tf	idf	tf_idf	gene_frequency_outside_cluster	gene_frequency_global	second_best_tf	second_best_cluster	pval	qval
0	Conversion of carnosine to beta-alanine	SOX9_basalis	0.558824	1.588612	0.887754	0.203806	0.204209	0.874235	Luminal	3.835604e-15	0.000000e+00
1	Fructose degradation (to glucose-3-phosphate)	SOX9_basalis	0.764706	0.705568	0.539552	0.493521	0.493828	0.913798	preGlandular	1.904306e-08	0.000000e+00
2	3'-Phospho-5'-adenylyl sulfate synthesis	SOX9_basalis	0.127451	2.932873	0.373798	0.053160	0.053244	0.293322	preGlandular	3.009349e-03	0.000000e+00
3	Biosynthesis of Tn_antigen (Glycoprotein N-ace...	SOX9_basalis	0.088235	4.154814	0.366601	0.015606	0.015689	0.088235	SOX9_basalis	3.486177e-05	0.000000e+00
4	Conversion of 1-phosphatidyl-1D-myo-inositol 4...	SOX9_basalis	0.068627	2.717943	0.186526	0.066007	0.066010	0.192308	Lymphatic	5.137296e-01	6.318684e-181

By examining the distributions of TF-IDF scores, we can filter the results to identify more specific tasks:

[9]:

# Visualize TF-IDF score distribution
mrks['tf_idf'].hist()

[9]:

<Axes: >

../_images/notebooks_metabolic_markers_22_1.png

[10]:

# Scatter plot of TF vs IDF with TF-IDF as color
scatter = plt.scatter(mrks['tf'], mrks['idf'], c=mrks['tf_idf'], cmap='YlGnBu')
plt.colorbar(scatter)
plt.xlabel('TF')
plt.ylabel('IDF')

[10]:

Text(0, 0.5, 'IDF')

../_images/notebooks_metabolic_markers_23_1.png

scCellFie includes a filtering function that fits a hyperbolic curve, while the user can define some thresholds based on the tf-idf scores, and the ratio between the top and second best cell clusters (tf_ratio).

[11]:

x_col = 'tf'
y_col = 'idf'
df = mrks
tfidf_threshold = 0.3
tf_ratio = 1.2

# Visualization
plt.figure(figsize=(6, 6))
# Plot all points
plt.scatter(df[x_col], df[y_col], alpha=0.5, c='lightgray', label='All points')
# Plot selected points
filtered_mrks, curve = sccellfie.external.filter_tfidf_markers(df, tf_col=x_col, idf_col=y_col,
                                                              tfidf_threshold=tfidf_threshold,
                                                              tf_ratio=tf_ratio)
plt.scatter(filtered_mrks[x_col], filtered_mrks[y_col], c='red', label='Selected points')
plt.plot(*curve, label='Fit curve')
plt.xlabel('TF')
plt.ylabel('IDF')
plt.legend()

[11]:

<matplotlib.legend.Legend at 0x7fd0f89e8880>

../_images/notebooks_metabolic_markers_25_1.png

A detailed list of the selected markers:

[26]:

filtered_mrks

[26]:

	gene	cluster	tf	idf	tf_idf	gene_frequency_outside_cluster	gene_frequency_global	second_best_tf	second_best_cluster	pval	qval
3	Biosynthesis of Tn_antigen (Glycoprotein N-ace...	SOX9_basalis	0.088235	4.154814	0.366601	0.015606	0.015689	0.088235	SOX9_basalis	3.486177e-05	0.000000e+00
20	Presence of the thioredoxin system through the...	SOX9_functionalis_I	0.320175	4.499821	1.440732	0.009537	0.011111	0.130323	SOX9_luminal	2.436788e-169	2.436788e-169
41	Biosynthesis of Tn_antigen (Glycoprotein N-ace...	SOX9_functionalis_II	0.167774	4.154814	0.697070	0.009330	0.015689	0.088235	SOX9_basalis	0.000000e+00	0.000000e+00
63	Presence of the thioredoxin system through the...	SOX9_luminal	0.130323	4.499821	0.586428	0.010076	0.011111	0.130323	SOX9_luminal	2.169252e-74	2.436788e-169
67	Conversion of estrone (E1) to estrone sulfate ...	SOX9_luminal	0.073548	4.105754	0.301972	0.015982	0.016478	0.073548	SOX9_luminal	1.432926e-20	1.030451e-259
101	Glucose to lactate conversion	preGlandular	0.544278	2.296725	1.250056	0.083174	0.100588	0.544278	preGlandular	0.000000e+00	0.000000e+00
102	ATP generation from glucose (hypoxic condition...	preGlandular	0.605766	1.987948	1.204232	0.118577	0.136976	0.605766	preGlandular	0.000000e+00	0.000000e+00
103	AMP salvage from adenine	preGlandular	0.601942	1.932106	1.163015	0.126902	0.144843	0.338397	Cycling	0.000000e+00	0.000000e+00
104	3'-Phospho-5'-adenylyl sulfate synthesis	preGlandular	0.293322	2.932873	0.860275	0.043821	0.053244	0.293322	preGlandular	0.000000e+00	0.000000e+00
107	GMP salvage from guanine	preGlandular	0.103854	5.456934	0.566725	0.000358	0.004267	0.017722	Cycling	0.000000e+00	0.000000e+00
108	IMP salvage from hypoxanthine	preGlandular	0.103854	5.456934	0.566725	0.000358	0.004267	0.017722	Cycling	0.000000e+00	0.000000e+00
120	3'-Phospho-5'-adenylyl sulfate synthesis	Glandular	0.647357	2.932873	1.898616	0.027442	0.053244	0.293322	preGlandular	0.000000e+00	0.000000e+00
140	Synthesis of anthranilate from tryptophan	Glandular_secretory	0.729369	2.805307	2.046104	0.041598	0.060488	0.729369	Glandular_secretory	0.000000e+00	0.000000e+00
147	Synthesis of L-kynurenine from tryptophan	Glandular_secretory	0.090210	4.927531	0.444514	0.004901	0.007244	0.090210	Glandular_secretory	1.518326e-177	1.518326e-177
148	Phenylalanine to phenylacetate (via phenylacet...	Glandular_secretory	0.087783	4.792850	0.420732	0.006044	0.008289	0.051948	Glandular_secretory_FGF7	1.280152e-155	1.280152e-155
149	Synthesis of N-formylanthranilate from tryptophan	Glandular_secretory	0.078883	5.300553	0.418126	0.002902	0.004989	0.078883	Glandular_secretory	2.985949e-179	2.985949e-179
164	Synthesis of alanine from glutamine	Glandular_secretory_FGF7	0.307359	3.305898	1.016099	0.035970	0.036666	0.081967	preCiliated	7.765291e-45	4.231413e-58
165	Synthesis of L-kynurenine from tryptophan	Glandular_secretory_FGF7	0.194805	4.927531	0.959909	0.006762	0.007244	0.090210	Glandular_secretory	6.100311e-50	1.518326e-177
166	Synthesis of N-formylanthranilate from tryptophan	Glandular_secretory_FGF7	0.142857	5.300553	0.757222	0.004634	0.004989	0.078883	Glandular_secretory	1.392133e-37	2.985949e-179
180	Conversion of estrone (E1) to estrone sulfate ...	preLuminal	0.328662	4.105754	1.349407	0.013731	0.016478	0.073548	SOX9_luminal	1.030451e-259	1.030451e-259
182	Mevalonate synthesis	preLuminal	0.368153	2.801823	1.031499	0.057994	0.060699	0.368153	preLuminal	1.489213e-145	0.000000e+00
184	Synthesis of testosterone from androstenedione	preLuminal	0.161783	4.681343	0.757364	0.007925	0.009267	0.131543	Luminal	4.782356e-116	0.000000e+00
200	Mevalonate synthesis	Luminal	0.690347	2.801823	1.934229	0.039422	0.060699	0.368153	preLuminal	0.000000e+00	0.000000e+00
205	Synthesis of testosterone from androstenedione	Luminal	0.131543	4.681343	0.615799	0.005134	0.009267	0.131543	Luminal	0.000000e+00	0.000000e+00
260	Synthesis of thromboxane from arachidonate	MUC5B	0.215539	4.655306	1.003399	0.008594	0.009511	0.128403	Immune_Myeloid	1.175419e-88	0.000000e+00
341	Glucuronate synthesis (via inositol)	mPV	0.416452	2.666280	1.110379	0.068004	0.069510	0.416452	mPV	1.042929e-82	1.737183e-230
342	NAD synthesis from nicotinamide	mPV	0.138817	6.707096	0.931062	0.000625	0.001222	0.065491	ePV_1a	3.589895e-98	3.589895e-98
343	Synthesis of phosphatidylinositol from inositol	mPV	0.141388	3.966842	0.560865	0.018402	0.018933	0.141388	mPV	6.861606e-31	2.793213e-56
360	Glucuronate synthesis (via inositol)	ePV_1a	0.518892	2.666280	1.383510	0.067519	0.069510	0.416452	mPV	2.032582e-128	1.737183e-230
362	Ethanol to acetaldehyde	ePV_1a	0.141058	6.662644	0.939819	0.000658	0.001278	0.141058	ePV_1a	5.120176e-101	5.120176e-101
367	NAD synthesis from nicotinamide	ePV_1a	0.065491	6.707096	0.439256	0.000937	0.001222	0.065491	ePV_1a	2.118539e-37	3.589895e-98
600	Conversion of glutamate to glutamine	dHormones	0.714286	2.658002	1.898573	0.069737	0.070088	0.590965	Immune_Myeloid	9.165935e-30	0.000000e+00
601	Phenylalanine to phenylacetaldehyde	dHormones	0.673469	2.347129	1.580719	0.095329	0.095643	0.673469	dHormones	1.542389e-22	0.000000e+00
640	Conversion of glutamate to glutamine	Immune_Myeloid	0.590965	2.658002	1.570787	0.050686	0.070088	0.590965	Immune_Myeloid	0.000000e+00	0.000000e+00
643	Synthesis of bilirubin	Immune_Myeloid	0.141089	4.639083	0.654524	0.004771	0.009667	0.025974	Glandular_secretory_FGF7	0.000000e+00	0.000000e+00
644	Synthesis of thromboxane from arachidonate	Immune_Myeloid	0.128403	4.655306	0.597757	0.005082	0.009511	0.128403	Immune_Myeloid	0.000000e+00	0.000000e+00
645	Pyridoxal-phosphate synthesis	Immune_Myeloid	0.170792	3.473063	0.593172	0.025816	0.031022	0.170792	Immune_Myeloid	5.799610e-248	5.799610e-248
647	Hydroxymethylglutaryl-CoA synthesis	Immune_Myeloid	0.094988	4.220675	0.400912	0.011698	0.014689	0.094988	Immune_Myeloid	1.201930e-155	3.545350e-172
648	Gluconeogenesis from Lactate	Immune_Myeloid	0.059097	5.713844	0.337668	0.001222	0.003300	0.012987	Glandular_secretory_FGF7	6.341532e-198	6.341532e-198
649	Glutamine synthesis	Immune_Myeloid	0.055384	5.918638	0.327796	0.000726	0.002689	0.017316	Glandular_secretory_FGF7	2.344026e-203	2.344026e-203
681	Conversion of 1-phosphatidyl-1D-myo-inositol 4...	Arterial	0.260780	2.717943	0.708786	0.064951	0.066010	0.192308	Lymphatic	1.398377e-41	6.318684e-181
700	Conversion of 1-phosphatidyl-1D-myo-inositol 4...	Lymphatic	0.192308	2.717943	0.522681	0.065828	0.066010	0.192308	Lymphatic	1.244686e-06	6.318684e-181

[12]:

# Get unique list of filtered markers
tf_idf_mrks = filtered_mrks['gene'].unique().tolist()
len(tf_idf_mrks)

[12]:

Detection using Logistic Regression in Scanpy

We can also use logistic regression, as implemented in Scanpy, to identify metabolic task markers:

[20]:

method = 'logreg'
sc.tl.rank_genes_groups(adata.metabolic_tasks, cell_group, method=method,
                       use_raw=False, key_added=method)
sc.pl.rank_genes_groups(adata.metabolic_tasks, n_genes=20, sharey=True, key=method, fontsize=4)

../_images/notebooks_metabolic_markers_30_0.png

[21]:

# Get the results as a dataframe
scanpy_df = sc.get.rank_genes_groups_df(adata.metabolic_tasks,
                                      key=method,
                                      group=None)

[22]:

scanpy_df.head()

[22]:

	group	names	scores
0	SOX9_basalis	3'-Phospho-5'-adenylyl sulfate synthesis	1.646670
1	SOX9_basalis	Biosynthesis of Tn_antigen (Glycoprotein N-ace...	1.050843
2	SOX9_basalis	Conversion of carnosine to beta-alanine	0.983181
3	SOX9_basalis	GDP-mannose synthesis	0.850716
4	SOX9_basalis	Synthesis of fructose-6-phosphate from erythro...	0.804910

We can visualize these markers using dot plots:

[27]:

sc.pl.rank_genes_groups_dotplot(adata.metabolic_tasks, n_genes=5, groupby=cell_group,
                               key=method, use_raw=False, standard_scale='var',
                              )

WARNING: dendrogram data not found (using key=dendrogram_celltype). Running `sc.tl.dendrogram` with default parameters. For fine tuning it is recommended to run `sc.tl.dendrogram` independently.

../_images/notebooks_metabolic_markers_34_1.png

We can filter the logistic regression markers based on their scores:

[28]:

# Visualize score distribution
scanpy_df['scores'].hist()

[28]:

<Axes: >

../_images/notebooks_metabolic_markers_36_1.png

[31]:

# Filter based on score threshold
scores_threshold = 1.0
filtered_scanpy_mrks = scanpy_df.loc[scanpy_df['scores'] > scores_threshold]

[30]:

filtered_scanpy_mrks

[30]:

	group	names	scores
0	SOX9_basalis	3'-Phospho-5'-adenylyl sulfate synthesis	1.646670
1	SOX9_basalis	Biosynthesis of Tn_antigen (Glycoprotein N-ace...	1.050843
215	SOX9_functionalis_I	3'-Phospho-5'-adenylyl sulfate synthesis	1.249713
430	SOX9_functionalis_II	Synthesis of quinolinate from tryptophan	1.505565
431	SOX9_functionalis_II	3'-Phospho-5'-adenylyl sulfate synthesis	1.135055
...	...	...	...
7101	Venous	Conversion of lysine to L-Saccharopine	1.190897
7102	Venous	Synthesis of galactosyl glucosyl ceramide (lin...	1.069577
7103	Venous	Degradation of uracil	1.018579
7104	Venous	Degradation of cytosine	1.018579
7310	Arterial	Synthesis of nitric oxide from arginine	2.254074

104 rows × 3 columns

[32]:

# Convert them to a list
scanpy_mrks = filtered_scanpy_mrks['names'].unique().tolist()
len(scanpy_mrks)

[32]:

Visualizations

We can combine the metabolic tasks from different approaches:

[33]:

both_markers = sorted(set(tf_idf_mrks + scanpy_mrks))
len(both_markers)

[33]:

Select only highly active markers

We can filter the markers based on their activity level by aggregating the metabolic activity to the cell-type level:

[34]:

# Aggregate the metabolic activity of markers into a cell-type level using trimean
agg = sccellfie.expression.aggregation.agg_expression_cells(adata.metabolic_tasks[:, both_markers],
                                                            groupby=cell_group,
                                                            agg_func='trimean')

Filter markers with an aggregated score of at least 1 in at least one cell type (you can adjust this threshold depending on how strict you want to be).

[35]:

both_markers = agg.T.loc[(agg.T > 1.).any(axis=1)].index.tolist()
len(both_markers)

[35]:

UMAP

Now we can visualize these metabolic task markers using UMAP:

[39]:

plt.rcParams['figure.figsize'] = (3,3)
plt.rcParams['font.size'] = 10

sc.pl.embedding(adata.metabolic_tasks,
               color=both_markers,
               ncols=4,
               palette=palette,
               frameon=False,
               basis='X_umap',
               wspace=0.7,
               title=["\n".join(textwrap.wrap(t, width=45)) for t in both_markers],
               cmap='OrRd'
              )

../_images/notebooks_metabolic_markers_48_0.png

Violin plot

We can further inspect the metabolic score distribution across single cells per cell type for these markers:

[40]:

fig, axes = sccellfie.plotting.create_multi_violin_plots(adata.metabolic_tasks,
                                                       features=both_markers,
                                                       groupby=cell_group,
                                                       stripplot=False,
                                                       n_cols=4,
                                                       ylabel='Metabolic Score'
                                                      )

../_images/notebooks_metabolic_markers_50_0.png

Heatmap

As a summary visualization, we can create a heatmap of the identified metabolic task markers across cell types based on the aggregated scores into a cell-type level (found in agg calculated before):

[42]:

filtered_agg = agg[both_markers]

[43]:

filtered_agg.head()

[43]:

Task	3'-Phospho-5'-adenylyl sulfate synthesis	AMP salvage from adenine	ATP generation from glucose (hypoxic conditions) - glycolysis	Biosynthesis of Tn_antigen (Glycoprotein N-acetyl-D-galactosamine)	Biosynthesis of core2 (beta-D-Galactosyl-1,3-(N-acetyl-beta-D-glucosaminyl-1,6)-N-acetyl-D-galactosaminyl-R)	Biosynthesis of core4 (N-Acetyl-beta-D-glucosaminyl-1,6-(N-acetyl-beta-D-glucosaminyl-1,3)-N-acetyl-D-galactosaminyl-R)	CMP-N-acetylneuraminate synthesis	Calnexin/calreticulin cycle	Conversion of 1-phosphatidyl-1D-myo-inositol 4,5-bisphosphate to 1D-myo-inositol 1,4,5-trisphosphate	Conversion of estrone (E1) to estrone sulfate (E1S)	...	Synthesis of anthranilate from tryptophan	Synthesis of bilirubin	Synthesis of fructose-6-phosphate from erythrose-4-phosphate (HMP shunt)	Synthesis of galactosyl glucosyl ceramide (link with ganglioside metabolism)	Synthesis of phosphatidylinositol from inositol	Synthesis of quinolinate from tryptophan	Synthesis of ribose-5-phosphate	Synthesis of testosterone from androstenedione	Synthesis of thromboxane from arachidonate	Tyrosine to acetoacetate and fumarate
Arterial	0.256560	1.675679	2.481569	0.683852	0.442055	0.344145	0.754417	0.773945	2.522644	0.104100	...	0.004131	0.796489	0.466867	0.699921	1.124336	0.031946	0.551582	1.033685	0.045549	0.030364
Ciliated	1.078283	1.451005	1.929947	1.445988	0.892795	0.846883	0.723723	0.565869	1.477182	1.089631	...	0.704579	1.092265	0.682542	0.758055	0.308139	0.291273	0.635575	0.865189	0.193303	0.553913
Cycling	2.098761	3.000086	3.718694	1.807871	1.074159	0.971235	1.017814	1.237260	1.803018	0.326987	...	0.273492	1.119333	1.211169	1.053797	0.542966	0.374733	0.957516	0.769935	0.106314	0.666761
Fibroblast_basalis	0.240646	1.469132	2.305450	0.710848	0.304962	0.355424	0.744974	0.695407	1.322372	0.140297	...	0.000000	0.951126	0.491220	0.580029	1.478609	0.081673	0.513473	1.045122	0.000000	0.026769
Glandular	3.829955	1.562031	2.227763	1.779426	1.111685	1.377226	0.908321	0.626390	1.595914	0.025143	...	2.093063	1.597565	0.794605	0.812703	0.527581	0.565066	0.695606	0.675938	0.201555	0.747315

5 rows × 46 columns

We then apply min-max normalization for better visualization:

[46]:

normalized_markers = sccellfie.preprocessing.min_max_normalization(filtered_agg, axis=0)

Followed by the heatmap

[50]:

# Create heatmap
plt.figure(figsize=(16, 9))
sns.heatmap(normalized_markers, cmap='YlOrBr', annot=False, linewidths=0.5,
            cbar_kws={'label': 'Min-max scaled metabolic score'})

plt.title('Metabolic Markers across Cell Types')
plt.ylabel('Cell Type')
plt.xlabel('Metabolic Task')
plt.tight_layout()

../_images/notebooks_metabolic_markers_57_0.png