Spatial Overview

In this tutorial, we will use a processed 10X visium dataset of a Mouse Embryo (CytAssist FFPE). Coupled with H&E image, the visium section captures the transcriptome of a whole mouse embryo, enabling identification of regional metabolic processes across organs at early development.

Loading libraries

[1]:

import sccellfie

import scanpy as sc
import squidpy as sq

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

## To avoid warnings
import warnings
warnings.filterwarnings("ignore")

/home/jovyan/my-conda-envs/single_cell/lib/python3.10/site-packages/dask/dataframe/__init__.py:31: FutureWarning: The legacy Dask DataFrame implementation is deprecated and will be removed in a future version. Set the configuration option `dataframe.query-planning` to `True` or None to enable the new Dask Dataframe implementation and silence this warning.
  warnings.warn(

Define parameters

[2]:

SPOT_SIZE = 1.3
DPI = 150
ANNOTATION = 'leiden'

Loading data

The 10X Visium Cytassist dataset contains 6434 spots (n_obs) with 19288 genes (n_vars).

This is a processed dataset created by the OpenProblems community. Coupled with H&E image, the visium section captures the transcriptome of a whole mouse embryo, enabling identification of regional metabolic processes.

[3]:

adata = sc.read(filename='./data/MusMusculus_WholeEmbryo.h5ad',
                backup_url='https://zenodo.org/records/15330688/files/MusMusculus_WholeEmbryo.h5ad')

[4]:

sc.pl.spatial(adata,
              color=[None, 'total_counts'],
              size=SPOT_SIZE,
              cmap='YlGnBu',
              img_key="hires",
             )

../_images/notebooks_mouse_embryo_spatial_10_0.png

Preprocessing spatial data

Gene names

Although gene names in the form of gene symbols are preferred, scCellFie supports the ENSEMBL IDs as used in .var_names of this AnnData object. scCellFie can internally convert ENSEMBL IDs into gene symbols. In this case the gene symbols can be found under the column name of feature_name.

[5]:

adata.var.head()

[5]:

	feature_id	feature_types	genome	n_cells_by_counts	mean_counts	log1p_mean_counts	pct_dropout_by_counts	total_counts	log1p_total_counts	n_cells	feature_name
gene_ids
ENSMUSG00000051951	ENSMUSG00000051951	Gene Expression	mm10	742	0.147187	0.137313	88.467516	947.0	6.854354	742	Xkr4
ENSMUSG00000025900	ENSMUSG00000025900	Gene Expression	mm10	440	0.073205	0.070649	93.161330	471.0	6.156979	440	Rp1
ENSMUSG00000025902	ENSMUSG00000025902	Gene Expression	mm10	2614	0.852813	0.616705	59.372086	5487.0	8.610319	2614	Sox17
ENSMUSG00000025903	ENSMUSG00000025903	Gene Expression	mm10	5205	4.515853	1.707626	19.101647	29055.0	10.276980	5205	Lypla1
ENSMUSG00000033813	ENSMUSG00000033813	Gene Expression	mm10	5652	6.649984	2.034703	12.154181	42786.0	10.663990	5652	Tcea1

(Optional) To work with gene symbols, we will replace the index names of adata.var with the column feature_name.

[6]:

adata.var.set_index('feature_name', inplace=True)
adata.var.head()

[6]:

	feature_id	feature_types	genome	n_cells_by_counts	mean_counts	log1p_mean_counts	pct_dropout_by_counts	total_counts	log1p_total_counts	n_cells
feature_name
Xkr4	ENSMUSG00000051951	Gene Expression	mm10	742	0.147187	0.137313	88.467516	947.0	6.854354	742
Rp1	ENSMUSG00000025900	Gene Expression	mm10	440	0.073205	0.070649	93.161330	471.0	6.156979	440
Sox17	ENSMUSG00000025902	Gene Expression	mm10	2614	0.852813	0.616705	59.372086	5487.0	8.610319	2614
Lypla1	ENSMUSG00000025903	Gene Expression	mm10	5205	4.515853	1.707626	19.101647	29055.0	10.276980	5205
Tcea1	ENSMUSG00000033813	Gene Expression	mm10	5652	6.649984	2.034703	12.154181	42786.0	10.663990	5652

Raw counts

Raw counts are the inputs that scCellFie uses to internally perform all data processing that is needed.

We start preparing .X to contain the raw count matrix:

[7]:

adata.X = adata.layers['counts'].copy()

Run scCellFie

In current version, scCellFie supports human or mouse specific metabolic pathways. We will select organism='mouse' for the analysis.

[8]:

results = sccellfie.run_sccellfie_pipeline(adata,
                                           organism='mouse',
                                           sccellfie_data_folder=None,
                                           n_counts_col='total_counts',
                                           process_by_group=False,
                                           groupby=None,
                                           neighbors_key='neighbors',
                                           n_neighbors=10,
                                           batch_key=None,
                                           threshold_key='sccellfie_threshold',
                                           smooth_cells=True,
                                           alpha=0.33,
                                           chunk_size=5000,
                                           disable_pbar=False,
                                           save_folder=None,
                                           save_filename=None
                                          )


==== scCellFie Pipeline: Initializing ====
Loading scCellFie database for organism: mouse

==== scCellFie Pipeline: Processing entire dataset ====

---- scCellFie Step: Preprocessing data ----

---- scCellFie Step: Computing neighbors ----

2025-05-10 12:43:36.638794: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1746881016.659355   13629 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746881016.665681   13629 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-10 12:43:36.697329: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


---- scCellFie Step: Preparing inputs ----
Gene names corrected to match database: 9
Shape of new adata object: (6434, 756)
Number of GPRs: 676
Shape of tasks by genes: (200, 756)
Shape of reactions by genes: (676, 756)
Shape of tasks by reactions: (200, 676)

---- scCellFie Step: Smoothing gene expression ----

Smoothing Expression: 100%|██████████| 1/1 [00:00<00:00,  1.15it/s]


---- scCellFie Step: Computing gene scores ----

---- scCellFie Step: Computing reaction activity ----

Cell Rxn Activities: 100%|██████████| 6434/6434 [00:53<00:00, 121.26it/s]


---- scCellFie Step: Computing metabolic task activity ----
Removed 1 metabolic tasks with zeros across all cells.

==== scCellFie Pipeline: Processing completed successfully ====

Export results

[9]:

results.keys()

[9]:

dict_keys(['adata', 'gpr_rules', 'task_by_gene', 'rxn_by_gene', 'task_by_rxn', 'rxn_info', 'task_info', 'thresholds', 'organism'])

To access metabolic activities, we need to inspect results['adata']:

The processed spatial data is located in the AnnData object results['adata'].
The reaction activities for each spot are located in the AnnData object results['adata'].reactions.
The metabolic task activities for each spot are located in the AnnData object results['adata'].metabolic_tasks.

In particular:

results['adata']: contains gene expression in .X.
results['adata'].layers['gene_scores']: contains gene scores as in the original CellFie paper.
results['adata'].uns['Rxn-Max-Genes']: contains determinant genes for each reaction per spot.
results['adata'].reactions: contains reaction scores in .X so every scanpy function can be used on this object to visualize or compare values.
results['adata'].metabolic_tasks: contains metabolic task scores in .X so every scanpy function can be used on this object to visualize or compare values.

Other keys in the results dictionary are associated with the scCellFie database and are already filtered for the elements present in the dataset ('gpr_rules', 'task_by_gene', 'rxn_by_gene', 'task_by_rxn', 'rxn_info', 'task_info', 'thresholds', 'organism').

Save spatial results

We can save our spatial results contained in the AnnData objects (results['adata']) into a specific folder. Here we only need to specify the output folder, and the basename that our AnnData objects will have. This function below saves the expression object (results['adata']) and those containing scores for reactions and metabolic tasks (results['adata'].reactions and results['adata'].metabolic_tasks, respectively), as separate files.

[10]:

sccellfie.io.save_adata(adata=results['adata'], output_directory='./results', filename='Mouse_Spatial_scCellFie')

./results/Mouse_Spatial_scCellFie.h5ad was correctly saved
./results/Mouse_Spatial_scCellFie_reactions.h5ad was correctly saved
./results/Mouse_Spatial_scCellFie_metabolic_tasks.h5ad was correctly saved

Spatial visualization

scCellFie implements multiple visualizations, including plots to see metabolic activity across tissues.

In this case, we first define metabolic tasks that may be important for early stages of development:

[11]:

development = [
    'ATP generation from glucose (hypoxic conditions) - glycolysis',
    'Conversion of histidine to histamine',
    'O-linked glycosylation',
    'CMP-N-acetylneuraminate synthesis',
    'Conversion of estrone (E1) to estradiol-17beta (E2)',
    'Synthesis of inositol',
    'Conversion of 1-phosphatidyl-1D-myo-inositol 4,5-bisphosphate to 1D-myo-inositol 1,4,5-trisphosphate',
    'Synthesis of GABA',
    'Reactive oxygen species detoxification (H2O2 to H2O)',
    'S-adenosyl-L-methionine synthesis'
]

Then, we use a scCellFie plotting function to visualize them in space:

[12]:

fig, axes = sccellfie.plotting.plot_spatial(results['adata'].metabolic_tasks,
                                            keys=development ,
                                            img_key='hires',
                                            cmap='YlGnBu',
                                            size=SPOT_SIZE,
                                            use_raw=False,
                                            ncols=5,
                                            hspace=0.3,
                                            vmin=0
                                            )

../_images/notebooks_mouse_embryo_spatial_30_0.png

Identification of metabolic markers

To identify markers, we can use different approaches. Here we show an approach based on TF-IDF, which comes from the Natural Language Processing field, and the logistic regression implemented in Scanpy.

Defining annotated regions for the selection of markers

To select markers, we need to define region annotations so the algorithms can focus on enriched activities in those regions.

As a demonstration, we will perform simple unsupervised clustering to define regional features of the whole embryo.

[13]:

sadata = adata.copy()
sc.pp.normalize_total(sadata, target_sum=1e4)
sc.pp.log1p(sadata)
sc.pp.highly_variable_genes(sadata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sadata = sadata[:, sadata.var.highly_variable]
sc.tl.pca(sadata, svd_solver='arpack')
sc.pp.neighbors(sadata, n_neighbors=10, n_pcs=40)
sc.tl.umap(sadata)
sc.tl.leiden(sadata, resolution=0.4)

Then, we assign these annotations to our AnnData objects in under results['adata'] to help with further analyses and visualizations

[14]:

results['adata'].obs[ANNOTATION] = sadata.obs['leiden']
results['adata'].reactions.obs[ANNOTATION] = sadata.obs['leiden']
results['adata'].metabolic_tasks.obs[ANNOTATION] = sadata.obs['leiden']

[15]:

fig, axes = sccellfie.plotting.plot_spatial(results['adata'],
                                            keys=[None, ANNOTATION],
                                            img_key='hires',
                                            cmap='YlGnBu',
                                            size=SPOT_SIZE ,
                                            use_raw=False,
                                            ncols=2,
                                            dpi=DPI
                                            )

../_images/notebooks_mouse_embryo_spatial_37_0.png

Marker detection using TF-IDF

scCellFie implements a TF-IDF (Term Frequency-Inverse Document Frequency) approach, adapted from the SoupX tool in R.

We define express_cut=5*np.log(2) to determine which spots are expressing each metabolic task. This value (5*np.log(2)) indicates that all reactions in the task are over the threshold value of the determinant gene (or “active”).

[16]:

mrks = sccellfie.external.quick_markers(results['adata'].metabolic_tasks,
                                        cluster_key=ANNOTATION,
                                        n_markers=20,
                                        express_cut=5*np.log(2))

[17]:

mrks.head()

[17]:

	gene	tf	idf	tf_idf	gene_frequency_outside_cluster	gene_frequency_global	second_best_tf	second_best_cluster
0	Glycogen degradation	0.905195	1.592333	1.441372	0.108051	0.203450	0.942529	15
1	Hydroxymethylglutaryl-CoA synthesis	0.802597	1.750950	1.405308	0.088100	0.173609	0.919540	15
2	Acetoacetate synthesis	0.709091	1.884865	1.336541	0.076095	0.151850	0.850575	15
3	Synthesis of acetone	0.709091	1.884865	1.336541	0.076095	0.151850	0.850575	15
4	Synthesis of creatine from arginine	0.990909	1.241558	1.230271	0.193503	0.288934	0.994253	15

By examining the distributions of TF-IDF scores, we can filter the results to identify more specific tasks:

[18]:

mrks['tf_idf'].hist()

[18]:

<Axes: >

../_images/notebooks_mouse_embryo_spatial_42_1.png

[19]:

# Scatter plot of TF vs IDF with TF-IDF as color
scatter = plt.scatter(mrks['tf'], mrks['idf'], c=mrks['tf_idf'], cmap='YlGnBu')
plt.colorbar(scatter)
plt.xlabel('TF')
plt.ylabel('IDF')

[19]:

Text(0, 0.5, 'IDF')

../_images/notebooks_mouse_embryo_spatial_43_1.png

scCellFie includes a filtering function that fits a hyperbolic curve, while the user can define some thresholds based on the tf-idf scores, and the ratio between the top and second best clusters (tf_ratio).

[20]:

x_col = 'tf'
y_col = 'idf'
df = mrks
tfidf_threshold = 0.3
tf_ratio = 1.2

# Visualization
plt.figure(figsize=(6, 6))
# Plot all points
plt.scatter(df[x_col], df[y_col], alpha=0.5, c='lightgray', label='All points')
# Plot selected points
filtered_mrks, curve = sccellfie.external.filter_tfidf_markers(df, tf_col=x_col, idf_col=y_col,
                                                              tfidf_threshold=tfidf_threshold,
                                                              tf_ratio=tf_ratio)
plt.scatter(filtered_mrks[x_col], filtered_mrks[y_col], c='red', label='Selected points')
plt.plot(*curve, label='Fit curve')
plt.xlabel('TF')
plt.ylabel('IDF')
plt.legend()

[20]:

<matplotlib.legend.Legend at 0x7f5c746ef130>

../_images/notebooks_mouse_embryo_spatial_45_1.png

A detailed list of the selected markers:

[21]:

filtered_mrks

[21]:

	gene	cluster	tf	idf	tf_idf	gene_frequency_outside_cluster	gene_frequency_global	second_best_tf	second_best_cluster	pval	qval
21	Ethanol to acetaldehyde	1	0.130841	4.184384	0.547490	0.004747	0.015232	0.058824	22	1.009938e-54	1.009938e-54
24	UDP-galactose synthesis	1	0.057944	5.214004	0.302120	0.000678	0.005440	0.057944	1	5.544113e-30	5.544113e-30
41	Glycerol-3-phosphate synthesis	2	0.107356	2.795542	0.300118	0.057157	0.061082	0.107356	2	2.263329e-05	0.000000e+00
62	S-adenosyl-L-methionine synthesis	3	0.301762	3.604566	1.087721	0.006355	0.027199	0.085324	11	5.039044e-130	5.039044e-130
63	Homocysteine synthesis (need methionine)	3	0.136564	4.478892	0.611655	0.001839	0.011346	0.037543	11	1.246696e-61	1.246696e-61
80	Degradation of cytosine	4	0.938497	2.583143	2.424271	0.012344	0.075536	0.430380	18	0.000000e+00	0.000000e+00
81	Degradation of uracil	4	0.938497	2.583143	2.424271	0.012344	0.075536	0.430380	18	0.000000e+00	0.000000e+00
100	Glycerol-3-phosphate synthesis	5	0.789731	2.795542	2.207726	0.011618	0.061082	0.107356	2	0.000000e+00	0.000000e+00
101	Krebs cycle - NADH generation	5	0.672372	3.099471	2.083996	0.002490	0.045073	0.031250	21	0.000000e+00	0.000000e+00
102	Gluconeogenesis from Glycerol	5	0.415648	3.559866	1.479651	0.002158	0.028443	0.025845	2	6.170726e-202	6.170726e-202
104	Synthesis of fructose-6-phosphate from erythro...	5	0.305623	3.522328	1.076506	0.010788	0.029531	0.139241	18	1.620833e-108	1.620833e-108
105	Glycine synthesis	5	0.259169	4.033153	1.045267	0.001328	0.017718	0.015905	2	3.962977e-122	3.962977e-122
106	Heme synthesis	5	0.232274	4.215475	0.979145	0.000000	0.014765	0.000000	22	2.854243e-119	2.854243e-119
109	Methionine degradation	5	0.146699	4.675007	0.685820	0.000000	0.009325	0.000000	22	2.170227e-74	2.170227e-74
110	Serine synthesis	5	0.146699	4.492686	0.659074	0.001992	0.011191	0.018809	7	1.714940e-61	1.714940e-61
120	Conversion of histidine to histamine	6	0.663043	2.409778	1.597787	0.055061	0.089835	0.529412	22	1.362200e-182	1.362200e-182
149	Gluconeogenesis from Lactate	7	0.216301	3.228088	0.698239	0.030417	0.039633	0.216301	7	7.201257e-34	1.629216e-44
160	Synthesis of taurine from cysteine	8	0.254839	2.430758	0.619451	0.079523	0.087970	0.254839	8	2.697996e-19	7.492150e-138
180	Synthesis of taurine from cysteine	9	0.631068	2.430758	1.533973	0.060571	0.087970	0.254839	8	7.492150e-138	7.492150e-138
200	UDP-N-acetyl D-galactosamine synthesis	10	0.078689	5.055780	0.397832	0.002774	0.006372	0.042071	9	4.986753e-22	4.986753e-22
220	Synthesis of bilirubin	11	0.986348	2.110058	2.081252	0.079954	0.121231	0.986348	11	4.338165e-281	0.000000e+00
221	Conversion of asparate to asparagine	11	0.907850	2.134718	1.938004	0.080606	0.118278	0.907850	11	7.658890e-231	0.000000e+00
222	S-adenosyl-L-methionine synthesis	11	0.085324	3.604566	0.307557	0.024426	0.027199	0.085324	11	2.733811e-07	5.039044e-130
306	(R)-3-Hydroxybutanoate synthesis	15	0.505747	2.311013	1.168788	0.087859	0.099161	0.505747	15	9.802464e-44	3.015513e-201
342	Conversion of estrone (E1) to estradiol-17beta...	17	0.057971	6.572127	0.380993	0.000159	0.001399	0.003226	8	3.233538e-13	3.233538e-13
343	Synthesis of testosterone from androstenedione	17	0.057971	6.466767	0.374885	0.000318	0.001554	0.057971	17	1.587698e-12	1.587698e-12
360	Degradation of cytosine	18	0.430380	2.583143	1.111732	0.071125	0.075536	0.430380	18	2.476438e-18	0.000000e+00
361	Degradation of uracil	18	0.430380	2.583143	1.111732	0.071125	0.075536	0.430380	18	2.476438e-18	0.000000e+00
363	Synthesis of fructose-6-phosphate from erythro...	18	0.139241	3.522328	0.490451	0.028167	0.029531	0.139241	18	1.745384e-05	1.620833e-108
380	Conversion of 1-phosphatidyl-1D-myo-inositol 4...	19	0.346667	2.477783	0.858965	0.080830	0.083929	0.346667	19	1.178491e-10	2.657205e-96
420	Gluconeogenesis from Lactate	21	0.656250	3.228088	2.118433	0.033438	0.039633	0.216301	7	1.629216e-44	1.629216e-44
421	(R)-3-Hydroxybutanoate synthesis	21	0.890625	2.311013	2.058246	0.091209	0.099161	0.505747	15	1.938126e-50	3.015513e-201
430	Gluconeogenesis from pyruvate	21	0.078125	5.214004	0.407344	0.004710	0.005440	0.053292	7	2.143704e-05	8.758385e-14
431	UDP-glucose synthesis	21	0.046875	7.670739	0.359566	0.000000	0.000466	0.000000	22	9.390129e-07	9.390129e-07
440	Conversion of histidine to histamine	22	0.529412	2.409778	1.275765	0.088671	0.089835	0.529412	22	4.510992e-06	1.362200e-182
441	Synthesis of phosphatidylinositol from inositol	22	0.176471	5.335365	0.941535	0.004363	0.004818	0.055000	14	6.580741e-05	9.784487e-10
442	Conversion of glycine to pyruvate	22	0.058824	7.670739	0.451220	0.000312	0.000466	0.003976	2	7.906940e-03	7.906940e-03

Finally, we will get the unique list of filtered markers

[22]:

tf_idf_mrks = filtered_mrks['gene'].unique().tolist()
len(tf_idf_mrks)

[22]:

[23]:

tf_idf_mrks

[23]:

['Ethanol to acetaldehyde',
 'UDP-galactose synthesis',
 'Glycerol-3-phosphate synthesis',
 'S-adenosyl-L-methionine synthesis',
 'Homocysteine synthesis (need methionine)',
 'Degradation of cytosine',
 'Degradation of uracil',
 'Krebs cycle - NADH generation',
 'Gluconeogenesis from Glycerol',
 'Synthesis of fructose-6-phosphate from erythrose-4-phosphate (HMP shunt)',
 'Glycine synthesis',
 'Heme synthesis',
 'Methionine degradation',
 'Serine synthesis',
 'Conversion of histidine to histamine',
 'Gluconeogenesis from Lactate',
 'Synthesis of taurine from cysteine',
 'UDP-N-acetyl D-galactosamine synthesis',
 'Synthesis of bilirubin',
 'Conversion of asparate to asparagine',
 '(R)-3-Hydroxybutanoate synthesis',
 'Conversion of estrone (E1) to estradiol-17beta (E2)',
 'Synthesis of testosterone from androstenedione',
 'Conversion of 1-phosphatidyl-1D-myo-inositol 4,5-bisphosphate to 1D-myo-inositol 1,4,5-trisphosphate',
 'Gluconeogenesis from pyruvate',
 'UDP-glucose synthesis',
 'Synthesis of phosphatidylinositol from inositol',
 'Conversion of glycine to pyruvate']

Detection using Logistic Regression in Scanpy

We can also use logistic regression, as implemented in Scanpy, to identify metabolic task markers:

[24]:

method = 'logreg'
sc.tl.rank_genes_groups(results['adata'].metabolic_tasks, groupby=ANNOTATION, method=method,
                       use_raw=False, key_added=method)
sc.pl.rank_genes_groups(results['adata'].metabolic_tasks, n_genes=20, sharey=True, key=method, fontsize=4)

../_images/notebooks_mouse_embryo_spatial_52_0.png

Extract the result in a dataframe

[25]:

scanpy_df = sc.get.rank_genes_groups_df(results['adata'].metabolic_tasks,
                                        key=method,
                                        group=None)

scanpy_df.head()

[25]:

	names	scores
0	Mannose degradation (to fructose-6-phosphate)	1.480771
1	Degradation of cytosine	1.127286
2	Degradation of uracil	1.127286
3	Degradation of adenine to urate	0.926927
4	ATP generation from glucose (hypoxic condition...	0.905349

We can filter the logistic regression markers based on their scores:

[26]:

scanpy_df['scores'].hist()

[26]:

<Axes: >

../_images/notebooks_mouse_embryo_spatial_56_1.png

[27]:

# Filter based on score threshold
scores_threshold = 1.5
filtered_scanpy_mrks = scanpy_df.loc[scanpy_df['scores'] > scores_threshold]

[28]:

filtered_scanpy_mrks

[28]:

	group	names	scores
597	3	Conversion of asparate to asparagine	1.626031
796	4	Degradation of uracil	1.820949
797	4	Degradation of cytosine	1.820949
798	4	Presence of the thioredoxin system through the...	1.747238
995	5	Oxidative phosphorylation via succinate-coenzy...	1.639906
1194	6	Conversion of histidine to histamine	1.805179
1393	7	Conversion of histidine to histamine	1.569943
1791	9	Synthesis of taurine from cysteine	1.565231
1990	10	UDP-N-acetyl D-galactosamine synthesis	1.535176
2189	11	Synthesis of bilirubin	1.540485
2388	12	Synthesis of creatine from arginine	1.737556
2587	13	Synthesis of creatine from arginine	1.926244
2786	14	Fucosylation (addition of fucose)	1.906202
2787	14	Conversion of estrone (E1) to estrone sulfate ...	1.570353
2985	15	Synthesis of creatine from arginine	2.710084
2986	15	Conversion of estrone (E1) to estrone sulfate ...	1.896867
3383	17	Synthesis of fructose-6-phosphate from erythro...	1.597093
3582	18	Glutathionate synthesis	1.662967
3980	20	NAD synthesis from nicotinamide	1.608922
4179	21	Glycogen degradation	2.259699
4180	21	Oxidative phosphorylation via succinate-coenzy...	1.691183

[29]:

# Convert them to a list
scanpy_mrks = filtered_scanpy_mrks['names'].unique().tolist()
len(scanpy_mrks)

[29]:

Combining TF-IDF & Logistic Regression Markers

We can combine the metabolic tasks from different approaches:

[32]:

both_markers = sorted(set(tf_idf_mrks).union(set(scanpy_mrks)))
len(both_markers)

[32]:

Select only highly active markers

We can filter the markers based on their activity level by aggregating the metabolic activity to the cell-type level. Here, we consider only metabolic tasks with a prudent score at the cluster level (> 1)

[33]:

agg = sccellfie.expression.aggregation.agg_expression_cells(results['adata'].metabolic_tasks[:, both_markers],
                                                            ANNOTATION,
                                                            agg_func='trimean')
agg = agg.T.loc[(agg.T > 1.).any(axis=1)]

both_markers = agg.index.tolist()

Then we can visualize them in space:

[34]:

## visualise only 5 markers due to notebook size
fig, axes = sccellfie.plotting.plot_spatial(results['adata'].metabolic_tasks,
                                            keys=[ANNOTATION] + both_markers[0:5],
                                            img_key='hires',
                                            cmap='YlGnBu',
                                            size=SPOT_SIZE,
                                            use_raw=False,
                                            ncols=3,
                                            vmin=0
                                            )

../_images/notebooks_mouse_embryo_spatial_66_0.png

Spatial patterns

scCellFie supports multiple algorithms to identify patterns across space, including the analysis of hotspots, assortativity, and autocorrelation (Moran’s I).

To illustrate the identification of spatial patterns, in this tutorial we will use the autocorrelation approach.

Generate spatial knn network

Methods to identify spatial patterns often rely on spatial neighbors, so we start computing our KNN network based on spatial locations. We compute this network in our original AnnData object and the one containing the metabolic scores for each task, in this case based on the 10-nearest neighbors:

[35]:

sccellfie.spatial.create_knn_network(adata, n_neighbors=10, spatial_key='spatial')

[36]:

sccellfie.spatial.create_knn_network(results['adata'].metabolic_tasks, n_neighbors=10, spatial_key='spatial')

Computing autocorrelation (Moran’s I coefficient)

Moran’s I is a measure of spatial autocorrelation, which quantifies how similar a variable is to itself in nearby spatial locations. In the context of spatial transcriptomics, where each spot or cell has a spatial coordinate and, here, an inferred metabolic activity score, Moran’s I can identify whether these scores are spatially clustered (positive autocorrelation), dispersed (negative autocorrelation), or randomly distributed (no autocorrelation).

Let:

\((x_i)\) be the metabolic activity score at spatial location (\(i\)),
\((\bar{x})\) be the mean metabolic activity across all locations,
\(( w_{ij} )\) be the spatial weight between location (\(i\)) and (\(j\)) (e.g., inverse distance or adjacency),
\((N)\) be the total number of spatial locations.

Then Moran’s I is computed as:

\[I = \frac{N}{W} \cdot \frac{\sum_{i=1}^{N} \sum_{j=1}^{N} w_{ij} (x_i - \bar{x})(x_j - \bar{x})}{\sum_{i=1}^{N} (x_i - \bar{x})^2}\]

where:

\[W = \sum_{i=1}^{N} \sum_{j=1}^{N} w_{ij}\]

Interpretation:

( I > 0 ): Positive spatial autocorrelation (similar metabolic activities cluster spatially).
( I < 0 ): Negative spatial autocorrelation (dissimilar metabolic activities are near each other).
( I \(\approx\) 0 ): No spatial autocorrelation (random spatial distribution).

[37]:

sq.gr.spatial_autocorr(
    results['adata'].metabolic_tasks,
    mode="moran",
    n_perms=1000,
    n_jobs=1,
)

[38]:

moran = results['adata'].metabolic_tasks.uns['moranI']

[39]:

moran

[39]:

	I	pval_norm	var_norm	pval_z_sim	pval_sim	var_sim	pval_norm_fdr_bh	pval_z_sim_fdr_bh	pval_sim_fdr_bh
Heme synthesis	0.943502	0.0	0.000057	0.0	0.000999	0.000154	0.0	0.0	0.000999
Tyrosine to acetoacetate and fumarate	0.931207	0.0	0.000057	0.0	0.000999	0.000147	0.0	0.0	0.000999
Synthesis of bilirubin	0.918218	0.0	0.000057	0.0	0.000999	0.000150	0.0	0.0	0.000999
Synthesis of globoside (link with globoside metabolism)	0.912165	0.0	0.000057	0.0	0.000999	0.000147	0.0	0.0	0.000999
Synthesis of glucocerebroside	0.908566	0.0	0.000057	0.0	0.000999	0.000147	0.0	0.0	0.000999
...	...	...	...	...	...	...	...	...	...
Synthesis of anthranilate from tryptophan	0.243993	0.0	0.000057	0.0	0.000999	0.000063	0.0	0.0	0.000999
Synthesis of L-kynurenine from tryptophan	0.177245	0.0	0.000057	0.0	0.000999	0.000054	0.0	0.0	0.000999
GDP-L-fucose synthesis	0.165651	0.0	0.000057	0.0	0.000999	0.000044	0.0	0.0	0.000999
Phosphatidyl-inositol to glucosaminyl-acylphosphatidylinositol	0.155917	0.0	0.000057	0.0	0.000999	0.000046	0.0	0.0	0.000999
Conversion of glutamate to glutamine	0.136900	0.0	0.000057	0.0	0.000999	0.000046	0.0	0.0	0.000999

199 rows × 9 columns

Visualize spatially informative variables (High Moran’s I coefficient)

[40]:

moran['I'].hist()

[40]:

<Axes: >

../_images/notebooks_mouse_embryo_spatial_76_1.png

We can set a threshold to select highly autocorrelated featurres, in this case we use a Moran’s I of at least 0.6

[47]:

coef_threshold = 0.6

[48]:

top_moran = moran[moran['I'].abs() >= coef_threshold].dropna().sort_values(by='I', ascending=False)

[49]:

top_moran

[49]:

	I	pval_norm	var_norm	pval_z_sim	pval_sim	var_sim	pval_norm_fdr_bh	pval_z_sim_fdr_bh	pval_sim_fdr_bh
Heme synthesis	0.943502	0.0	0.000057	0.0	0.000999	0.000154	0.0	0.0	0.000999
Tyrosine to acetoacetate and fumarate	0.931207	0.0	0.000057	0.0	0.000999	0.000147	0.0	0.0	0.000999
Synthesis of bilirubin	0.918218	0.0	0.000057	0.0	0.000999	0.000150	0.0	0.0	0.000999
Synthesis of globoside (link with globoside metabolism)	0.912165	0.0	0.000057	0.0	0.000999	0.000147	0.0	0.0	0.000999
Synthesis of glucocerebroside	0.908566	0.0	0.000057	0.0	0.000999	0.000147	0.0	0.0	0.000999
...	...	...	...	...	...	...	...	...	...
Inosine monophosphate synthesis (IMP)	0.609340	0.0	0.000057	0.0	0.000999	0.000090	0.0	0.0	0.000999
Degradation of n2m2nmasn	0.608268	0.0	0.000057	0.0	0.000999	0.000101	0.0	0.0	0.000999
N-glycan processing (ER)	0.604972	0.0	0.000057	0.0	0.000999	0.000089	0.0	0.0	0.000999
Deoxyadenosine triphosphate synthesis (dATP)	0.604090	0.0	0.000057	0.0	0.000999	0.000090	0.0	0.0	0.000999
Deoxycytidine triphosphate synthesis (dCTP)	0.600372	0.0	0.000057	0.0	0.000999	0.000099	0.0	0.0	0.000999

145 rows × 9 columns

[50]:

spa_informative_var = list(top_moran.index)

Then, we can further filter top metabolic tasks to pick those with a high activity in at least one Visium spot

[51]:

mt_df = results['adata'].metabolic_tasks.to_df().max(axis=0)
spa_informative_var = list(mt_df[spa_informative_var][mt_df[spa_informative_var] > 5*np.log(2)].index)

[52]:

spa_informative_var

[52]:

['Heme synthesis',
 'Synthesis of bilirubin',
 'Homocysteine degradation',
 'Krebs cycle - NADH generation',
 'Degradation of uracil',
 'Degradation of cytosine',
 'Methionine degradation',
 'Cysteine synthesis (need serine and methionine)',
 'Gluconeogenesis from Glycerol',
 'Glycerol-3-phosphate synthesis',
 'Oxidative phosphorylation via succinate-coenzyme Q oxidoreductase (COMPLEX II)',
 'Gluconeogenesis from pyruvate',
 'Phenylalanine to tyrosine',
 'Gluconeogenesis from Lactate',
 'Synthesis of GABA',
 'Glutathionate synthesis',
 'Reactive oxygen species detoxification (H2O2 to H2O)',
 'Synthesis of creatine from arginine',
 'Synthesis of malonyl-coa',
 'Synthesis of acetone',
 'Acetoacetate synthesis',
 'ATP generation from glucose (hypoxic conditions) - glycolysis',
 'Hydroxymethylglutaryl-CoA synthesis',
 'Glycogen degradation',
 'Glycine synthesis',
 'Conversion of asparate to asparagine',
 'Homocysteine synthesis (need methionine)',
 'Synthesis of methylglyoxal',
 '(R)-3-Hydroxybutanoate synthesis',
 'Serine synthesis',
 'UDP-glucose synthesis',
 'Synthesis of fructose-6-phosphate from erythrose-4-phosphate (HMP shunt)',
 'Glucuronate synthesis (via udp-glucose)',
 'UDP-glucuronate synthesis',
 'S-adenosyl-L-methionine synthesis',
 "3'-Phospho-5'-adenylyl sulfate synthesis",
 'Conversion of histidine to histamine',
 'Mannose trimming (mannosidase)',
 'Synthesis of inositol',
 'UDP-N-acetyl D-galactosamine synthesis',
 'Conversion of glycine to pyruvate',
 'Presence of the thioredoxin system through the thioredoxin reductase activity',
 'Biosynthesis of m4mpdol_U',
 'Conversion of 1-phosphatidyl-1D-myo-inositol 4,5-bisphosphate to 1D-myo-inositol 1,4,5-trisphosphate',
 'Synthesis of ornithine from glutamine',
 'Conversion of estrone (E1) to estrone sulfate (E1S)',
 'Conversion of glutamate to proline',
 'CMP-N-acetylneuraminate synthesis',
 'Synthesis of taurine from cysteine']

Finally, we can visualize them in space to observe the patterns they have:

[53]:

spa_inf_titles = [t + " (Moran's I: {:.3f})".format(top_moran.loc[t, 'I']) for t in spa_informative_var ]

## visualise only 5 markers due to notebook size
fig, axes = sccellfie.plotting.plot_spatial(results['adata'].metabolic_tasks,
                                            keys=spa_informative_var [0:5],
                                            img_key='hires',
                                            cmap='YlGnBu',
                                            size=SPOT_SIZE,
                                            use_raw=False,
                                            ncols=3,
                                            vmin=0,
                                            hspace=0.2,
                                            title=spa_inf_titles
                                            )

../_images/notebooks_mouse_embryo_spatial_86_0.png

Cell-cell communication

For spatial data, scCellFie can predict cell-cell communication (CCC) by calculating the co-localization of the synthesis of a metabolite and the receptor expression within a pre-defined neighborhood. This is different to the approach we take in single-cell data as shown in this tutorial, where we infer CCC by first aggregating metabolic scores and gene expression into a cell type level, and then using them to compute a communication score.

Defining neighborhoods

We start defining our neighborhoods by finding an acceptable ratio where each spot is allocated with its immediate neighbors:

[54]:

r = sccellfie.spatial.compute_neighbor_distribution(
    adata,
    radius_range=(100, 1500),  # adjust based on your data's scale
    n_points=200,
    spatial_key='spatial'
)

[55]:

sccellfie.plotting.spatial.plot_neighbor_distribution(r)

[55]:

(<Figure size 1500x800 with 4 Axes>, GridSpec(2, 3))

../_images/notebooks_mouse_embryo_spatial_90_1.png

Here, we select a radius of 450 to keep mode of neighbors (in distribution) of ~6 (immediate neighbors).

[56]:

radius = 450

Data preparation

scCellFie generates separate AnnData objects containing either reaction scores (adata.reactions) or task scores (adata.metabolic_tasks). Therefore, to infer cell-cell communication, we need to put metabolic task scores together with gene expression in the same AnnData object.

Here, our adata already contains log1p(CP10k) values, as scCellFie processed it when running its pipeline.

Then, we transfer metabolic scores into the AnnData object containing gene expression. For this, we use a built-in function in scCellFie to transfer features from one source adata into another target adata. In this case we transfer all metabolic tasks (by using var_names=adata.metabolic_tasks.var_names).

[57]:

adata_updated = sccellfie.preprocessing.adata_utils.transfer_variables(
    adata_target=adata,
    adata_source=results['adata'].metabolic_tasks,
    var_names=results['adata'].metabolic_tasks.var_names
)

adata_updated

[57]:

AnnData object with n_obs × n_vars = 6434 × 19487
    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'n_genes', 'size_factors'
    uns: 'dataset_description', 'dataset_id', 'dataset_name', 'dataset_organism', 'dataset_reference', 'dataset_summary', 'dataset_url', 'normalization_id', 'spatial', 'normalization', 'neighbors', 'umap', 'spatial_neighbors', 'spatial_network'
    obsm: 'spatial', 'X_pca', 'X_umap'
    layers: 'counts', 'normalized'
    obsp: 'distances', 'connectivities', 'spatial_connectivities', 'spatial_distances'

Compute cell-cell communication based on synthesis and receptor pairs

As a demonstration, we selected the 'Synthesis of GABA' to compute local co-localization scores for the pathway and its represented receptor Gabra1.

A 'pairwise_concordance' score is computed by first identifying neighboring spots within a predefined radius (already done a couple of steps above). Each sender-receiver spot pair within the neighborhood is then evaluated.

To measure co-localization, we account for pairs of spots where one partner presents a metabolic score above a threshold and the other has the receptor expression above another threshold. Here, we do not input any threshold value to the function, so scCellFie set them by default to be mean metabolic score and receptor expression across all spots. Finally, the co-localization score corresponds to the fraction between the pairs of spots that are above the respective thresholds and the total number of pairs within a given neighborhood.

[58]:

adata_updated.obs['GABA^Gabra1'] = sccellfie.communication.compute_local_colocalization_scores(
    adata_updated,
    var1='Synthesis of GABA',
    var2='Gabra1',
    neighbors_radius=radius,
    method='pairwise_concordance',
    min_neighbors=3,
    spatial_key='spatial',
    inplace=False
)

Then, we can visualize these synthesis of GABA, the expression of the receptor, the co-localization communication scores, and the main enzymes involved in this synthesis (Gad1 and Gad2)

[59]:

fig, axes = sccellfie.plotting.plot_spatial(adata_updated,
                                            keys=['GABA^Gabra1',
                                                  'Synthesis of GABA',
                                                  'Gabra1', # Receptor
                                                  'Gad1', # Synthase
                                                  'Gad2', # Synthase,
                                                  None # H&E
                                                  ],
                                            img_key="hires",
                                            cmap='YlGnBu',
                                            size=SPOT_SIZE,
                                            use_raw=False,
                                            ncols=3,
                                            vmin=0,
                                            hspace=0.15
                                            )

../_images/notebooks_mouse_embryo_spatial_98_0.png