Spatial Overview

Open in Colab

In this tutorial, we will use a processed 10X visium dataset of a Mouse Embryo (CytAssist FFPE). Coupled with H&E image, the visium section captures the transcriptome of a whole mouse embryo, enabling identification of regional metabolic processes across organs at early development.

This tutorial includes following steps:

Loading libraries

[1]:
import sccellfie

import scanpy as sc
import squidpy as sq

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

## To avoid warnings
import warnings
warnings.filterwarnings("ignore")
/home/jovyan/my-conda-envs/single_cell/lib/python3.10/site-packages/dask/dataframe/__init__.py:31: FutureWarning: The legacy Dask DataFrame implementation is deprecated and will be removed in a future version. Set the configuration option `dataframe.query-planning` to `True` or None to enable the new Dask Dataframe implementation and silence this warning.
  warnings.warn(

Define parameters

[2]:
SPOT_SIZE = 1.3
DPI = 150
ANNOTATION = 'leiden'

Loading data

The 10X Visium Cytassist dataset contains 6434 spots (n_obs) with 19288 genes (n_vars).

This is a processed dataset created by the OpenProblems community. Coupled with H&E image, the visium section captures the transcriptome of a whole mouse embryo, enabling identification of regional metabolic processes.

[3]:
adata = sc.read(filename='./data/MusMusculus_WholeEmbryo.h5ad',
                backup_url='https://zenodo.org/records/15330688/files/MusMusculus_WholeEmbryo.h5ad')
[4]:
sc.pl.spatial(adata,
              color=[None, 'total_counts'],
              size=SPOT_SIZE,
              cmap='YlGnBu',
              img_key="hires",
             )
../_images/notebooks_mouse_embryo_spatial_10_0.png

Preprocessing spatial data

Gene names

Although gene names in the form of gene symbols are preferred, scCellFie supports the ENSEMBL IDs as used in .var_names of this AnnData object. scCellFie can internally convert ENSEMBL IDs into gene symbols. In this case the gene symbols can be found under the column name of feature_name.

[5]:
adata.var.head()
[5]:
feature_id feature_types genome n_cells_by_counts mean_counts log1p_mean_counts pct_dropout_by_counts total_counts log1p_total_counts n_cells feature_name
gene_ids
ENSMUSG00000051951 ENSMUSG00000051951 Gene Expression mm10 742 0.147187 0.137313 88.467516 947.0 6.854354 742 Xkr4
ENSMUSG00000025900 ENSMUSG00000025900 Gene Expression mm10 440 0.073205 0.070649 93.161330 471.0 6.156979 440 Rp1
ENSMUSG00000025902 ENSMUSG00000025902 Gene Expression mm10 2614 0.852813 0.616705 59.372086 5487.0 8.610319 2614 Sox17
ENSMUSG00000025903 ENSMUSG00000025903 Gene Expression mm10 5205 4.515853 1.707626 19.101647 29055.0 10.276980 5205 Lypla1
ENSMUSG00000033813 ENSMUSG00000033813 Gene Expression mm10 5652 6.649984 2.034703 12.154181 42786.0 10.663990 5652 Tcea1

(Optional) To work with gene symbols, we will replace the index names of adata.var with the column feature_name.

[6]:
adata.var.set_index('feature_name', inplace=True)
adata.var.head()
[6]:
feature_id feature_types genome n_cells_by_counts mean_counts log1p_mean_counts pct_dropout_by_counts total_counts log1p_total_counts n_cells
feature_name
Xkr4 ENSMUSG00000051951 Gene Expression mm10 742 0.147187 0.137313 88.467516 947.0 6.854354 742
Rp1 ENSMUSG00000025900 Gene Expression mm10 440 0.073205 0.070649 93.161330 471.0 6.156979 440
Sox17 ENSMUSG00000025902 Gene Expression mm10 2614 0.852813 0.616705 59.372086 5487.0 8.610319 2614
Lypla1 ENSMUSG00000025903 Gene Expression mm10 5205 4.515853 1.707626 19.101647 29055.0 10.276980 5205
Tcea1 ENSMUSG00000033813 Gene Expression mm10 5652 6.649984 2.034703 12.154181 42786.0 10.663990 5652

Raw counts

Raw counts are the inputs that scCellFie uses to internally perform all data processing that is needed.

We start preparing .X to contain the raw count matrix:

[7]:
adata.X = adata.layers['counts'].copy()

Run scCellFie

In current version, scCellFie supports human or mouse specific metabolic pathways. We will select organism='mouse' for the analysis.

[8]:
results = sccellfie.run_sccellfie_pipeline(adata,
                                           organism='mouse',
                                           sccellfie_data_folder=None,
                                           n_counts_col='total_counts',
                                           process_by_group=False,
                                           groupby=None,
                                           neighbors_key='neighbors',
                                           n_neighbors=10,
                                           batch_key=None,
                                           threshold_key='sccellfie_threshold',
                                           smooth_cells=True,
                                           alpha=0.33,
                                           chunk_size=5000,
                                           disable_pbar=False,
                                           save_folder=None,
                                           save_filename=None
                                          )

==== scCellFie Pipeline: Initializing ====
Loading scCellFie database for organism: mouse

==== scCellFie Pipeline: Processing entire dataset ====

---- scCellFie Step: Preprocessing data ----

---- scCellFie Step: Computing neighbors ----
2025-05-10 12:43:36.638794: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1746881016.659355   13629 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746881016.665681   13629 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-05-10 12:43:36.697329: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

---- scCellFie Step: Preparing inputs ----
Gene names corrected to match database: 9
Shape of new adata object: (6434, 756)
Number of GPRs: 676
Shape of tasks by genes: (200, 756)
Shape of reactions by genes: (676, 756)
Shape of tasks by reactions: (200, 676)

---- scCellFie Step: Smoothing gene expression ----
Smoothing Expression: 100%|██████████| 1/1 [00:00<00:00,  1.15it/s]

---- scCellFie Step: Computing gene scores ----

---- scCellFie Step: Computing reaction activity ----
Cell Rxn Activities: 100%|██████████| 6434/6434 [00:53<00:00, 121.26it/s]

---- scCellFie Step: Computing metabolic task activity ----
Removed 1 metabolic tasks with zeros across all cells.

==== scCellFie Pipeline: Processing completed successfully ====

Export results

[9]:
results.keys()
[9]:
dict_keys(['adata', 'gpr_rules', 'task_by_gene', 'rxn_by_gene', 'task_by_rxn', 'rxn_info', 'task_info', 'thresholds', 'organism'])

To access metabolic activities, we need to inspect results['adata']:

  • The processed spatial data is located in the AnnData object results['adata'].

  • The reaction activities for each spot are located in the AnnData object results['adata'].reactions.

  • The metabolic task activities for each spot are located in the AnnData object results['adata'].metabolic_tasks.

In particular:

  • results['adata']: contains gene expression in .X.

  • results['adata'].layers['gene_scores']: contains gene scores as in the original CellFie paper.

  • results['adata'].uns['Rxn-Max-Genes']: contains determinant genes for each reaction per spot.

  • results['adata'].reactions: contains reaction scores in .X so every scanpy function can be used on this object to visualize or compare values.

  • results['adata'].metabolic_tasks: contains metabolic task scores in .X so every scanpy function can be used on this object to visualize or compare values.

Other keys in the results dictionary are associated with the scCellFie database and are already filtered for the elements present in the dataset ('gpr_rules', 'task_by_gene', 'rxn_by_gene', 'task_by_rxn', 'rxn_info', 'task_info', 'thresholds', 'organism').

Save spatial results

We can save our spatial results contained in the AnnData objects (results['adata']) into a specific folder. Here we only need to specify the output folder, and the basename that our AnnData objects will have. This function below saves the expression object (results['adata']) and those containing scores for reactions and metabolic tasks (results['adata'].reactions and results['adata'].metabolic_tasks, respectively), as separate files.

[10]:
sccellfie.io.save_adata(adata=results['adata'], output_directory='./results', filename='Mouse_Spatial_scCellFie')
./results/Mouse_Spatial_scCellFie.h5ad was correctly saved
./results/Mouse_Spatial_scCellFie_reactions.h5ad was correctly saved
./results/Mouse_Spatial_scCellFie_metabolic_tasks.h5ad was correctly saved

Spatial visualization

scCellFie implements multiple visualizations, including plots to see metabolic activity across tissues.

In this case, we first define metabolic tasks that may be important for early stages of development:

[11]:
development = [
    'ATP generation from glucose (hypoxic conditions) - glycolysis',
    'Conversion of histidine to histamine',
    'O-linked glycosylation',
    'CMP-N-acetylneuraminate synthesis',
    'Conversion of estrone (E1) to estradiol-17beta (E2)',
    'Synthesis of inositol',
    'Conversion of 1-phosphatidyl-1D-myo-inositol 4,5-bisphosphate to 1D-myo-inositol 1,4,5-trisphosphate',
    'Synthesis of GABA',
    'Reactive oxygen species detoxification (H2O2 to H2O)',
    'S-adenosyl-L-methionine synthesis'
]

Then, we use a scCellFie plotting function to visualize them in space:

[12]:
fig, axes = sccellfie.plotting.plot_spatial(results['adata'].metabolic_tasks,
                                            keys=development ,
                                            img_key='hires',
                                            cmap='YlGnBu',
                                            size=SPOT_SIZE,
                                            use_raw=False,
                                            ncols=5,
                                            hspace=0.3,
                                            vmin=0
                                            )
../_images/notebooks_mouse_embryo_spatial_30_0.png

Identification of metabolic markers

To identify markers, we can use different approaches. Here we show an approach based on TF-IDF, which comes from the Natural Language Processing field, and the logistic regression implemented in Scanpy.

Defining annotated regions for the selection of markers

To select markers, we need to define region annotations so the algorithms can focus on enriched activities in those regions.

As a demonstration, we will perform simple unsupervised clustering to define regional features of the whole embryo.

[13]:
sadata = adata.copy()
sc.pp.normalize_total(sadata, target_sum=1e4)
sc.pp.log1p(sadata)
sc.pp.highly_variable_genes(sadata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sadata = sadata[:, sadata.var.highly_variable]
sc.tl.pca(sadata, svd_solver='arpack')
sc.pp.neighbors(sadata, n_neighbors=10, n_pcs=40)
sc.tl.umap(sadata)
sc.tl.leiden(sadata, resolution=0.4)

Then, we assign these annotations to our AnnData objects in under results['adata'] to help with further analyses and visualizations

[14]:
results['adata'].obs[ANNOTATION] = sadata.obs['leiden']
results['adata'].reactions.obs[ANNOTATION] = sadata.obs['leiden']
results['adata'].metabolic_tasks.obs[ANNOTATION] = sadata.obs['leiden']
[15]:
fig, axes = sccellfie.plotting.plot_spatial(results['adata'],
                                            keys=[None, ANNOTATION],
                                            img_key='hires',
                                            cmap='YlGnBu',
                                            size=SPOT_SIZE ,
                                            use_raw=False,
                                            ncols=2,
                                            dpi=DPI
                                            )
../_images/notebooks_mouse_embryo_spatial_37_0.png

Marker detection using TF-IDF

scCellFie implements a TF-IDF (Term Frequency-Inverse Document Frequency) approach, adapted from the SoupX tool in R.

We define express_cut=5*np.log(2) to determine which spots are expressing each metabolic task. This value (5*np.log(2)) indicates that all reactions in the task are over the threshold value of the determinant gene (or “active”).

[16]:
mrks = sccellfie.external.quick_markers(results['adata'].metabolic_tasks,
                                        cluster_key=ANNOTATION,
                                        n_markers=20,
                                        express_cut=5*np.log(2))
[17]:
mrks.head()
[17]:
gene cluster tf idf tf_idf gene_frequency_outside_cluster gene_frequency_global second_best_tf second_best_cluster pval qval
0 Glycogen degradation 0 0.905195 1.592333 1.441372 0.108051 0.203450 0.942529 15 0.0 0.0
1 Hydroxymethylglutaryl-CoA synthesis 0 0.802597 1.750950 1.405308 0.088100 0.173609 0.919540 15 0.0 0.0
2 Acetoacetate synthesis 0 0.709091 1.884865 1.336541 0.076095 0.151850 0.850575 15 0.0 0.0
3 Synthesis of acetone 0 0.709091 1.884865 1.336541 0.076095 0.151850 0.850575 15 0.0 0.0
4 Synthesis of creatine from arginine 0 0.990909 1.241558 1.230271 0.193503 0.288934 0.994253 15 0.0 0.0

By examining the distributions of TF-IDF scores, we can filter the results to identify more specific tasks:

[18]:
mrks['tf_idf'].hist()
[18]:
<Axes: >
../_images/notebooks_mouse_embryo_spatial_42_1.png
[19]:
# Scatter plot of TF vs IDF with TF-IDF as color
scatter = plt.scatter(mrks['tf'], mrks['idf'], c=mrks['tf_idf'], cmap='YlGnBu')
plt.colorbar(scatter)
plt.xlabel('TF')
plt.ylabel('IDF')
[19]:
Text(0, 0.5, 'IDF')
../_images/notebooks_mouse_embryo_spatial_43_1.png

scCellFie includes a filtering function that fits a hyperbolic curve, while the user can define some thresholds based on the tf-idf scores, and the ratio between the top and second best clusters (tf_ratio).

[20]:
x_col = 'tf'
y_col = 'idf'
df = mrks
tfidf_threshold = 0.3
tf_ratio = 1.2

# Visualization
plt.figure(figsize=(6, 6))
# Plot all points
plt.scatter(df[x_col], df[y_col], alpha=0.5, c='lightgray', label='All points')
# Plot selected points
filtered_mrks, curve = sccellfie.external.filter_tfidf_markers(df, tf_col=x_col, idf_col=y_col,
                                                              tfidf_threshold=tfidf_threshold,
                                                              tf_ratio=tf_ratio)
plt.scatter(filtered_mrks[x_col], filtered_mrks[y_col], c='red', label='Selected points')
plt.plot(*curve, label='Fit curve')
plt.xlabel('TF')
plt.ylabel('IDF')
plt.legend()
[20]:
<matplotlib.legend.Legend at 0x7f5c746ef130>
../_images/notebooks_mouse_embryo_spatial_45_1.png

A detailed list of the selected markers:

[21]:
filtered_mrks
[21]:
gene cluster tf idf tf_idf gene_frequency_outside_cluster gene_frequency_global second_best_tf second_best_cluster pval qval
21 Ethanol to acetaldehyde 1 0.130841 4.184384 0.547490 0.004747 0.015232 0.058824 22 1.009938e-54 1.009938e-54
24 UDP-galactose synthesis 1 0.057944 5.214004 0.302120 0.000678 0.005440 0.057944 1 5.544113e-30 5.544113e-30
41 Glycerol-3-phosphate synthesis 2 0.107356 2.795542 0.300118 0.057157 0.061082 0.107356 2 2.263329e-05 0.000000e+00
62 S-adenosyl-L-methionine synthesis 3 0.301762 3.604566 1.087721 0.006355 0.027199 0.085324 11 5.039044e-130 5.039044e-130
63 Homocysteine synthesis (need methionine) 3 0.136564 4.478892 0.611655 0.001839 0.011346 0.037543 11 1.246696e-61 1.246696e-61
80 Degradation of cytosine 4 0.938497 2.583143 2.424271 0.012344 0.075536 0.430380 18 0.000000e+00 0.000000e+00
81 Degradation of uracil 4 0.938497 2.583143 2.424271 0.012344 0.075536 0.430380 18 0.000000e+00 0.000000e+00
100 Glycerol-3-phosphate synthesis 5 0.789731 2.795542 2.207726 0.011618 0.061082 0.107356 2 0.000000e+00 0.000000e+00
101 Krebs cycle - NADH generation 5 0.672372 3.099471 2.083996 0.002490 0.045073 0.031250 21 0.000000e+00 0.000000e+00
102 Gluconeogenesis from Glycerol 5 0.415648 3.559866 1.479651 0.002158 0.028443 0.025845 2 6.170726e-202 6.170726e-202
104 Synthesis of fructose-6-phosphate from erythro... 5 0.305623 3.522328 1.076506 0.010788 0.029531 0.139241 18 1.620833e-108 1.620833e-108
105 Glycine synthesis 5 0.259169 4.033153 1.045267 0.001328 0.017718 0.015905 2 3.962977e-122 3.962977e-122
106 Heme synthesis 5 0.232274 4.215475 0.979145 0.000000 0.014765 0.000000 22 2.854243e-119 2.854243e-119
109 Methionine degradation 5 0.146699 4.675007 0.685820 0.000000 0.009325 0.000000 22 2.170227e-74 2.170227e-74
110 Serine synthesis 5 0.146699 4.492686 0.659074 0.001992 0.011191 0.018809 7 1.714940e-61 1.714940e-61
120 Conversion of histidine to histamine 6 0.663043 2.409778 1.597787 0.055061 0.089835 0.529412 22 1.362200e-182 1.362200e-182
149 Gluconeogenesis from Lactate 7 0.216301 3.228088 0.698239 0.030417 0.039633 0.216301 7 7.201257e-34 1.629216e-44
160 Synthesis of taurine from cysteine 8 0.254839 2.430758 0.619451 0.079523 0.087970 0.254839 8 2.697996e-19 7.492150e-138
180 Synthesis of taurine from cysteine 9 0.631068 2.430758 1.533973 0.060571 0.087970 0.254839 8 7.492150e-138 7.492150e-138
200 UDP-N-acetyl D-galactosamine synthesis 10 0.078689 5.055780 0.397832 0.002774 0.006372 0.042071 9 4.986753e-22 4.986753e-22
220 Synthesis of bilirubin 11 0.986348 2.110058 2.081252 0.079954 0.121231 0.986348 11 4.338165e-281 0.000000e+00
221 Conversion of asparate to asparagine 11 0.907850 2.134718 1.938004 0.080606 0.118278 0.907850 11 7.658890e-231 0.000000e+00
222 S-adenosyl-L-methionine synthesis 11 0.085324 3.604566 0.307557 0.024426 0.027199 0.085324 11 2.733811e-07 5.039044e-130
306 (R)-3-Hydroxybutanoate synthesis 15 0.505747 2.311013 1.168788 0.087859 0.099161 0.505747 15 9.802464e-44 3.015513e-201
342 Conversion of estrone (E1) to estradiol-17beta... 17 0.057971 6.572127 0.380993 0.000159 0.001399 0.003226 8 3.233538e-13 3.233538e-13
343 Synthesis of testosterone from androstenedione 17 0.057971 6.466767 0.374885 0.000318 0.001554 0.057971 17 1.587698e-12 1.587698e-12
360 Degradation of cytosine 18 0.430380 2.583143 1.111732 0.071125 0.075536 0.430380 18 2.476438e-18 0.000000e+00
361 Degradation of uracil 18 0.430380 2.583143 1.111732 0.071125 0.075536 0.430380 18 2.476438e-18 0.000000e+00
363 Synthesis of fructose-6-phosphate from erythro... 18 0.139241 3.522328 0.490451 0.028167 0.029531 0.139241 18 1.745384e-05 1.620833e-108
380 Conversion of 1-phosphatidyl-1D-myo-inositol 4... 19 0.346667 2.477783 0.858965 0.080830 0.083929 0.346667 19 1.178491e-10 2.657205e-96
420 Gluconeogenesis from Lactate 21 0.656250 3.228088 2.118433 0.033438 0.039633 0.216301 7 1.629216e-44 1.629216e-44
421 (R)-3-Hydroxybutanoate synthesis 21 0.890625 2.311013 2.058246 0.091209 0.099161 0.505747 15 1.938126e-50 3.015513e-201
430 Gluconeogenesis from pyruvate 21 0.078125 5.214004 0.407344 0.004710 0.005440 0.053292 7 2.143704e-05 8.758385e-14
431 UDP-glucose synthesis 21 0.046875 7.670739 0.359566 0.000000 0.000466 0.000000 22 9.390129e-07 9.390129e-07
440 Conversion of histidine to histamine 22 0.529412 2.409778 1.275765 0.088671 0.089835 0.529412 22 4.510992e-06 1.362200e-182
441 Synthesis of phosphatidylinositol from inositol 22 0.176471 5.335365 0.941535 0.004363 0.004818 0.055000 14 6.580741e-05 9.784487e-10
442 Conversion of glycine to pyruvate 22 0.058824 7.670739 0.451220 0.000312 0.000466 0.003976 2 7.906940e-03 7.906940e-03

Finally, we will get the unique list of filtered markers

[22]:
tf_idf_mrks = filtered_mrks['gene'].unique().tolist()
len(tf_idf_mrks)
[22]:
28
[23]:
tf_idf_mrks
[23]:
['Ethanol to acetaldehyde',
 'UDP-galactose synthesis',
 'Glycerol-3-phosphate synthesis',
 'S-adenosyl-L-methionine synthesis',
 'Homocysteine synthesis (need methionine)',
 'Degradation of cytosine',
 'Degradation of uracil',
 'Krebs cycle - NADH generation',
 'Gluconeogenesis from Glycerol',
 'Synthesis of fructose-6-phosphate from erythrose-4-phosphate (HMP shunt)',
 'Glycine synthesis',
 'Heme synthesis',
 'Methionine degradation',
 'Serine synthesis',
 'Conversion of histidine to histamine',
 'Gluconeogenesis from Lactate',
 'Synthesis of taurine from cysteine',
 'UDP-N-acetyl D-galactosamine synthesis',
 'Synthesis of bilirubin',
 'Conversion of asparate to asparagine',
 '(R)-3-Hydroxybutanoate synthesis',
 'Conversion of estrone (E1) to estradiol-17beta (E2)',
 'Synthesis of testosterone from androstenedione',
 'Conversion of 1-phosphatidyl-1D-myo-inositol 4,5-bisphosphate to 1D-myo-inositol 1,4,5-trisphosphate',
 'Gluconeogenesis from pyruvate',
 'UDP-glucose synthesis',
 'Synthesis of phosphatidylinositol from inositol',
 'Conversion of glycine to pyruvate']

Detection using Logistic Regression in Scanpy

We can also use logistic regression, as implemented in Scanpy, to identify metabolic task markers:

[24]:
method = 'logreg'
sc.tl.rank_genes_groups(results['adata'].metabolic_tasks, groupby=ANNOTATION, method=method,
                       use_raw=False, key_added=method)
sc.pl.rank_genes_groups(results['adata'].metabolic_tasks, n_genes=20, sharey=True, key=method, fontsize=4)
../_images/notebooks_mouse_embryo_spatial_52_0.png

Extract the result in a dataframe

[25]:
scanpy_df = sc.get.rank_genes_groups_df(results['adata'].metabolic_tasks,
                                        key=method,
                                        group=None)

scanpy_df.head()
[25]:
group names scores
0 0 Mannose degradation (to fructose-6-phosphate) 1.480771
1 0 Degradation of cytosine 1.127286
2 0 Degradation of uracil 1.127286
3 0 Degradation of adenine to urate 0.926927
4 0 ATP generation from glucose (hypoxic condition... 0.905349

We can filter the logistic regression markers based on their scores:

[26]:
scanpy_df['scores'].hist()
[26]:
<Axes: >
../_images/notebooks_mouse_embryo_spatial_56_1.png
[27]:
# Filter based on score threshold
scores_threshold = 1.5
filtered_scanpy_mrks = scanpy_df.loc[scanpy_df['scores'] > scores_threshold]
[28]:
filtered_scanpy_mrks
[28]:
group names scores
597 3 Conversion of asparate to asparagine 1.626031
796 4 Degradation of uracil 1.820949
797 4 Degradation of cytosine 1.820949
798 4 Presence of the thioredoxin system through the... 1.747238
995 5 Oxidative phosphorylation via succinate-coenzy... 1.639906
1194 6 Conversion of histidine to histamine 1.805179
1393 7 Conversion of histidine to histamine 1.569943
1791 9 Synthesis of taurine from cysteine 1.565231
1990 10 UDP-N-acetyl D-galactosamine synthesis 1.535176
2189 11 Synthesis of bilirubin 1.540485
2388 12 Synthesis of creatine from arginine 1.737556
2587 13 Synthesis of creatine from arginine 1.926244
2786 14 Fucosylation (addition of fucose) 1.906202
2787 14 Conversion of estrone (E1) to estrone sulfate ... 1.570353
2985 15 Synthesis of creatine from arginine 2.710084
2986 15 Conversion of estrone (E1) to estrone sulfate ... 1.896867
3383 17 Synthesis of fructose-6-phosphate from erythro... 1.597093
3582 18 Glutathionate synthesis 1.662967
3980 20 NAD synthesis from nicotinamide 1.608922
4179 21 Glycogen degradation 2.259699
4180 21 Oxidative phosphorylation via succinate-coenzy... 1.691183
[29]:
# Convert them to a list
scanpy_mrks = filtered_scanpy_mrks['names'].unique().tolist()
len(scanpy_mrks)
[29]:
16

Combining TF-IDF & Logistic Regression Markers

We can combine the metabolic tasks from different approaches:

[32]:
both_markers = sorted(set(tf_idf_mrks).union(set(scanpy_mrks)))
len(both_markers)
[32]:
36

Select only highly active markers

We can filter the markers based on their activity level by aggregating the metabolic activity to the cell-type level. Here, we consider only metabolic tasks with a prudent score at the cluster level (> 1)

[33]:
agg = sccellfie.expression.aggregation.agg_expression_cells(results['adata'].metabolic_tasks[:, both_markers],
                                                            ANNOTATION,
                                                            agg_func='trimean')
agg = agg.T.loc[(agg.T > 1.).any(axis=1)]

both_markers = agg.index.tolist()

Then we can visualize them in space:

[34]:
## visualise only 5 markers due to notebook size
fig, axes = sccellfie.plotting.plot_spatial(results['adata'].metabolic_tasks,
                                            keys=[ANNOTATION] + both_markers[0:5],
                                            img_key='hires',
                                            cmap='YlGnBu',
                                            size=SPOT_SIZE,
                                            use_raw=False,
                                            ncols=3,
                                            vmin=0
                                            )
../_images/notebooks_mouse_embryo_spatial_66_0.png

Spatial patterns

scCellFie supports multiple algorithms to identify patterns across space, including the analysis of hotspots, assortativity, and autocorrelation (Moran’s I).

To illustrate the identification of spatial patterns, in this tutorial we will use the autocorrelation approach.

Generate spatial knn network

Methods to identify spatial patterns often rely on spatial neighbors, so we start computing our KNN network based on spatial locations. We compute this network in our original AnnData object and the one containing the metabolic scores for each task, in this case based on the 10-nearest neighbors:

[35]:
sccellfie.spatial.create_knn_network(adata, n_neighbors=10, spatial_key='spatial')
[36]:
sccellfie.spatial.create_knn_network(results['adata'].metabolic_tasks, n_neighbors=10, spatial_key='spatial')

Computing autocorrelation (Moran’s I coefficient)

Moran’s I is a measure of spatial autocorrelation, which quantifies how similar a variable is to itself in nearby spatial locations. In the context of spatial transcriptomics, where each spot or cell has a spatial coordinate and, here, an inferred metabolic activity score, Moran’s I can identify whether these scores are spatially clustered (positive autocorrelation), dispersed (negative autocorrelation), or randomly distributed (no autocorrelation).

Let:

  • \((x_i)\) be the metabolic activity score at spatial location (\(i\)),

  • \((\bar{x})\) be the mean metabolic activity across all locations,

  • \(( w_{ij} )\) be the spatial weight between location (\(i\)) and (\(j\)) (e.g., inverse distance or adjacency),

  • \((N)\) be the total number of spatial locations.

Then Moran’s I is computed as:

\[I = \frac{N}{W} \cdot \frac{\sum_{i=1}^{N} \sum_{j=1}^{N} w_{ij} (x_i - \bar{x})(x_j - \bar{x})}{\sum_{i=1}^{N} (x_i - \bar{x})^2}\]

where:

\[W = \sum_{i=1}^{N} \sum_{j=1}^{N} w_{ij}\]

Interpretation:

  • ( I > 0 ): Positive spatial autocorrelation (similar metabolic activities cluster spatially).

  • ( I < 0 ): Negative spatial autocorrelation (dissimilar metabolic activities are near each other).

  • ( I \(\approx\) 0 ): No spatial autocorrelation (random spatial distribution).

[37]:
sq.gr.spatial_autocorr(
    results['adata'].metabolic_tasks,
    mode="moran",
    n_perms=1000,
    n_jobs=1,
)
[38]:
moran = results['adata'].metabolic_tasks.uns['moranI']
[39]:
moran
[39]:
I pval_norm var_norm pval_z_sim pval_sim var_sim pval_norm_fdr_bh pval_z_sim_fdr_bh pval_sim_fdr_bh
Heme synthesis 0.943502 0.0 0.000057 0.0 0.000999 0.000154 0.0 0.0 0.000999
Tyrosine to acetoacetate and fumarate 0.931207 0.0 0.000057 0.0 0.000999 0.000147 0.0 0.0 0.000999
Synthesis of bilirubin 0.918218 0.0 0.000057 0.0 0.000999 0.000150 0.0 0.0 0.000999
Synthesis of globoside (link with globoside metabolism) 0.912165 0.0 0.000057 0.0 0.000999 0.000147 0.0 0.0 0.000999
Synthesis of glucocerebroside 0.908566 0.0 0.000057 0.0 0.000999 0.000147 0.0 0.0 0.000999
... ... ... ... ... ... ... ... ... ...
Synthesis of anthranilate from tryptophan 0.243993 0.0 0.000057 0.0 0.000999 0.000063 0.0 0.0 0.000999
Synthesis of L-kynurenine from tryptophan 0.177245 0.0 0.000057 0.0 0.000999 0.000054 0.0 0.0 0.000999
GDP-L-fucose synthesis 0.165651 0.0 0.000057 0.0 0.000999 0.000044 0.0 0.0 0.000999
Phosphatidyl-inositol to glucosaminyl-acylphosphatidylinositol 0.155917 0.0 0.000057 0.0 0.000999 0.000046 0.0 0.0 0.000999
Conversion of glutamate to glutamine 0.136900 0.0 0.000057 0.0 0.000999 0.000046 0.0 0.0 0.000999

199 rows × 9 columns

Visualize spatially informative variables (High Moran’s I coefficient)

[40]:
moran['I'].hist()
[40]:
<Axes: >
../_images/notebooks_mouse_embryo_spatial_76_1.png

We can set a threshold to select highly autocorrelated featurres, in this case we use a Moran’s I of at least 0.6

[47]:
coef_threshold = 0.6
[48]:
top_moran = moran[moran['I'].abs() >= coef_threshold].dropna().sort_values(by='I', ascending=False)
[49]:
top_moran
[49]:
I pval_norm var_norm pval_z_sim pval_sim var_sim pval_norm_fdr_bh pval_z_sim_fdr_bh pval_sim_fdr_bh
Heme synthesis 0.943502 0.0 0.000057 0.0 0.000999 0.000154 0.0 0.0 0.000999
Tyrosine to acetoacetate and fumarate 0.931207 0.0 0.000057 0.0 0.000999 0.000147 0.0 0.0 0.000999
Synthesis of bilirubin 0.918218 0.0 0.000057 0.0 0.000999 0.000150 0.0 0.0 0.000999
Synthesis of globoside (link with globoside metabolism) 0.912165 0.0 0.000057 0.0 0.000999 0.000147 0.0 0.0 0.000999
Synthesis of glucocerebroside 0.908566 0.0 0.000057 0.0 0.000999 0.000147 0.0 0.0 0.000999
... ... ... ... ... ... ... ... ... ...
Inosine monophosphate synthesis (IMP) 0.609340 0.0 0.000057 0.0 0.000999 0.000090 0.0 0.0 0.000999
Degradation of n2m2nmasn 0.608268 0.0 0.000057 0.0 0.000999 0.000101 0.0 0.0 0.000999
N-glycan processing (ER) 0.604972 0.0 0.000057 0.0 0.000999 0.000089 0.0 0.0 0.000999
Deoxyadenosine triphosphate synthesis (dATP) 0.604090 0.0 0.000057 0.0 0.000999 0.000090 0.0 0.0 0.000999
Deoxycytidine triphosphate synthesis (dCTP) 0.600372 0.0 0.000057 0.0 0.000999 0.000099 0.0 0.0 0.000999

145 rows × 9 columns

[50]:
spa_informative_var = list(top_moran.index)

Then, we can further filter top metabolic tasks to pick those with a high activity in at least one Visium spot

[51]:
mt_df = results['adata'].metabolic_tasks.to_df().max(axis=0)
spa_informative_var = list(mt_df[spa_informative_var][mt_df[spa_informative_var] > 5*np.log(2)].index)
[52]:
spa_informative_var
[52]:
['Heme synthesis',
 'Synthesis of bilirubin',
 'Homocysteine degradation',
 'Krebs cycle - NADH generation',
 'Degradation of uracil',
 'Degradation of cytosine',
 'Methionine degradation',
 'Cysteine synthesis (need serine and methionine)',
 'Gluconeogenesis from Glycerol',
 'Glycerol-3-phosphate synthesis',
 'Oxidative phosphorylation via succinate-coenzyme Q oxidoreductase (COMPLEX II)',
 'Gluconeogenesis from pyruvate',
 'Phenylalanine to tyrosine',
 'Gluconeogenesis from Lactate',
 'Synthesis of GABA',
 'Glutathionate synthesis',
 'Reactive oxygen species detoxification (H2O2 to H2O)',
 'Synthesis of creatine from arginine',
 'Synthesis of malonyl-coa',
 'Synthesis of acetone',
 'Acetoacetate synthesis',
 'ATP generation from glucose (hypoxic conditions) - glycolysis',
 'Hydroxymethylglutaryl-CoA synthesis',
 'Glycogen degradation',
 'Glycine synthesis',
 'Conversion of asparate to asparagine',
 'Homocysteine synthesis (need methionine)',
 'Synthesis of methylglyoxal',
 '(R)-3-Hydroxybutanoate synthesis',
 'Serine synthesis',
 'UDP-glucose synthesis',
 'Synthesis of fructose-6-phosphate from erythrose-4-phosphate (HMP shunt)',
 'Glucuronate synthesis (via udp-glucose)',
 'UDP-glucuronate synthesis',
 'S-adenosyl-L-methionine synthesis',
 "3'-Phospho-5'-adenylyl sulfate synthesis",
 'Conversion of histidine to histamine',
 'Mannose trimming (mannosidase)',
 'Synthesis of inositol',
 'UDP-N-acetyl D-galactosamine synthesis',
 'Conversion of glycine to pyruvate',
 'Presence of the thioredoxin system through the thioredoxin reductase activity',
 'Biosynthesis of m4mpdol_U',
 'Conversion of 1-phosphatidyl-1D-myo-inositol 4,5-bisphosphate to 1D-myo-inositol 1,4,5-trisphosphate',
 'Synthesis of ornithine from glutamine',
 'Conversion of estrone (E1) to estrone sulfate (E1S)',
 'Conversion of glutamate to proline',
 'CMP-N-acetylneuraminate synthesis',
 'Synthesis of taurine from cysteine']

Finally, we can visualize them in space to observe the patterns they have:

[53]:
spa_inf_titles = [t + " (Moran's I: {:.3f})".format(top_moran.loc[t, 'I']) for t in spa_informative_var ]

## visualise only 5 markers due to notebook size
fig, axes = sccellfie.plotting.plot_spatial(results['adata'].metabolic_tasks,
                                            keys=spa_informative_var [0:5],
                                            img_key='hires',
                                            cmap='YlGnBu',
                                            size=SPOT_SIZE,
                                            use_raw=False,
                                            ncols=3,
                                            vmin=0,
                                            hspace=0.2,
                                            title=spa_inf_titles
                                            )
../_images/notebooks_mouse_embryo_spatial_86_0.png

Cell-cell communication

For spatial data, scCellFie can predict cell-cell communication (CCC) by calculating the co-localization of the synthesis of a metabolite and the receptor expression within a pre-defined neighborhood. This is different to the approach we take in single-cell data as shown in this tutorial, where we infer CCC by first aggregating metabolic scores and gene expression into a cell type level, and then using them to compute a communication score.

Defining neighborhoods

We start defining our neighborhoods by finding an acceptable ratio where each spot is allocated with its immediate neighbors:

[54]:
r = sccellfie.spatial.compute_neighbor_distribution(
    adata,
    radius_range=(100, 1500),  # adjust based on your data's scale
    n_points=200,
    spatial_key='spatial'
)
[55]:
sccellfie.plotting.spatial.plot_neighbor_distribution(r)
[55]:
(<Figure size 1500x800 with 4 Axes>, GridSpec(2, 3))
../_images/notebooks_mouse_embryo_spatial_90_1.png

Here, we select a radius of 450 to keep mode of neighbors (in distribution) of ~6 (immediate neighbors).

[56]:
radius = 450

Data preparation

scCellFie generates separate AnnData objects containing either reaction scores (adata.reactions) or task scores (adata.metabolic_tasks). Therefore, to infer cell-cell communication, we need to put metabolic task scores together with gene expression in the same AnnData object.

Here, our adata already contains log1p(CP10k) values, as scCellFie processed it when running its pipeline.

Then, we transfer metabolic scores into the AnnData object containing gene expression. For this, we use a built-in function in scCellFie to transfer features from one source adata into another target adata. In this case we transfer all metabolic tasks (by using var_names=adata.metabolic_tasks.var_names).

[57]:
adata_updated = sccellfie.preprocessing.adata_utils.transfer_variables(
    adata_target=adata,
    adata_source=results['adata'].metabolic_tasks,
    var_names=results['adata'].metabolic_tasks.var_names
)

adata_updated
[57]:
AnnData object with n_obs × n_vars = 6434 × 19487
    obs: 'in_tissue', 'array_row', 'array_col', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'n_genes', 'size_factors'
    uns: 'dataset_description', 'dataset_id', 'dataset_name', 'dataset_organism', 'dataset_reference', 'dataset_summary', 'dataset_url', 'normalization_id', 'spatial', 'normalization', 'neighbors', 'umap', 'spatial_neighbors', 'spatial_network'
    obsm: 'spatial', 'X_pca', 'X_umap'
    layers: 'counts', 'normalized'
    obsp: 'distances', 'connectivities', 'spatial_connectivities', 'spatial_distances'

Compute cell-cell communication based on synthesis and receptor pairs

As a demonstration, we selected the 'Synthesis of GABA' to compute local co-localization scores for the pathway and its represented receptor Gabra1.

A 'pairwise_concordance' score is computed by first identifying neighboring spots within a predefined radius (already done a couple of steps above). Each sender-receiver spot pair within the neighborhood is then evaluated.

To measure co-localization, we account for pairs of spots where one partner presents a metabolic score above a threshold and the other has the receptor expression above another threshold. Here, we do not input any threshold value to the function, so scCellFie set them by default to be mean metabolic score and receptor expression across all spots. Finally, the co-localization score corresponds to the fraction between the pairs of spots that are above the respective thresholds and the total number of pairs within a given neighborhood.

[58]:
adata_updated.obs['GABA^Gabra1'] = sccellfie.communication.compute_local_colocalization_scores(
    adata_updated,
    var1='Synthesis of GABA',
    var2='Gabra1',
    neighbors_radius=radius,
    method='pairwise_concordance',
    min_neighbors=3,
    spatial_key='spatial',
    inplace=False
)

Then, we can visualize these synthesis of GABA, the expression of the receptor, the co-localization communication scores, and the main enzymes involved in this synthesis (Gad1 and Gad2)

[59]:
fig, axes = sccellfie.plotting.plot_spatial(adata_updated,
                                            keys=['GABA^Gabra1',
                                                  'Synthesis of GABA',
                                                  'Gabra1', # Receptor
                                                  'Gad1', # Synthase
                                                  'Gad2', # Synthase,
                                                  None # H&E
                                                  ],
                                            img_key="hires",
                                            cmap='YlGnBu',
                                            size=SPOT_SIZE,
                                            use_raw=False,
                                            ncols=3,
                                            vmin=0,
                                            hspace=0.15
                                            )
../_images/notebooks_mouse_embryo_spatial_98_0.png