{ "cells": [ { "cell_type": "markdown", "id": "suburban-dressing", "metadata": {}, "source": [ "# General Overview" ] }, { "cell_type": "markdown", "id": "monthly-station", "metadata": {}, "source": [ "[](https://colab.research.google.com/github/earmingol/scCellFie/blob/main/docs/source/notebooks/extended_quick_start.ipynb)" ] }, { "cell_type": "markdown", "id": "italian-community", "metadata": {}, "source": [ "This notebook provides a general overview of scCellFie analyses, including the inference of metabolic tasks, metabolic marker identification, differential analysis, and cell-cell communication.\n", "\n", "To illustrate a different application, the dataset we are using here includes [BALF samples from donors with varying severities of COVID-19](https://zenodo.org/record/7535867/files/BALF-COVID19-Liao_et_al-NatMed-2020.h5ad) [(Liao et al 2020)](https://doi.org/10.1038/s41591-020-0901-9)." ] }, { "cell_type": "markdown", "id": "weird-annotation", "metadata": {}, "source": [ "## This tutorial includes following steps:\n", "* [Loading libraries](#loading-libraries)\n", "* [Loading COVID-19 data](#loading-covid-19-data)\n", "* [Run scCellFie](#run-sccellfie)\n", "* [Export results](#export-results)\n", "* [Identification of metabolic task markers and visualization](#identification-of-metabolic-task-markers-and-visualization)\n", "* [Differential metabolic analysis](#differential-metabolic-analysis)\n", "* [Cell-cell communication](#cell-cell-communication)" ] }, { "cell_type": "markdown", "id": "french-shanghai", "metadata": {}, "source": [ "## Loading libraries " ] }, { "cell_type": "code", "execution_count": null, "id": "indonesian-reviewer", "metadata": {}, "outputs": [], "source": [ "import sccellfie\n", "import scanpy as sc\n", "import pandas as pd\n", "import numpy as np\n", "\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import glasbey\n", "\n", "import textwrap\n", "\n", "## To avoid warnings\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "id": "lonely-cheat", "metadata": {}, "source": [ "## Loading COVID-19 data " ] }, { "cell_type": "markdown", "id": "offshore-glance", "metadata": {}, "source": [ "We start loading our single-cell data. In this case, `adata.X` already contains raw counts, which are the main inputs of scCellFie." ] }, { "cell_type": "code", "execution_count": 3, "id": "killing-nepal", "metadata": {}, "outputs": [], "source": [ "adata = sc.read(filename='BALF-COVID19.h5ad', \n", " backup_url='https://zenodo.org/record/7535867/files/BALF-COVID19-Liao_et_al-NatMed-2020.h5ad')" ] }, { "cell_type": "code", "execution_count": 4, "id": "going-timber", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 63103 × 33538\n", " obs: 'sample', 'sample_new', 'group', 'disease', 'hasnCoV', 'cluster', 'celltype', 'condition'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata" ] }, { "cell_type": "markdown", "id": "sixth-terrorist", "metadata": {}, "source": [ "## Run scCellFie " ] }, { "cell_type": "markdown", "id": "ambient-imperial", "metadata": {}, "source": [ "Now we run scCellFie on the raw data." ] }, { "cell_type": "code", "execution_count": 5, "id": "prerequisite-palestinian", "metadata": {}, "outputs": [], "source": [ "batch_key = 'sample' # Specify batch_key or leave as None" ] }, { "cell_type": "code", "execution_count": 6, "id": "armed-shoot", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "==== scCellFie Pipeline: Initializing ====\n", "Loading scCellFie database for organism: human\n", "\n", "==== scCellFie Pipeline: Processing entire dataset ====\n", "\n", "---- scCellFie Step: Preprocessing data ----\n", "\n", "---- scCellFie Step: Computing neighbors ----\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "2025-04-14 13:02:45.789400: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n", "WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n", "E0000 00:00:1744635765.810778 14108 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n", "E0000 00:00:1744635765.817489 14108 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n", "2025-04-14 13:02:45.844525: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n", "To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "---- scCellFie Step: Preparing inputs ----\n", "Gene names corrected to match database: 22\n", "Shape of new adata object: (63103, 929)\n", "Number of GPRs: 787\n", "Shape of tasks by genes: (218, 929)\n", "Shape of reactions by genes: (787, 929)\n", "Shape of tasks by reactions: (218, 787)\n", "\n", "---- scCellFie Step: Smoothing gene expression ----\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Smoothing Expression: 100%|██████████| 13/13 [01:09<00:00, 5.36s/it]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "---- scCellFie Step: Computing gene scores ----\n", "\n", "---- scCellFie Step: Computing reaction activity ----\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Cell Rxn Activities: 100%|██████████| 63103/63103 [05:49<00:00, 180.48it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "---- scCellFie Step: Computing metabolic task activity ----\n", "Removed 1 metabolic tasks with zeros across all cells.\n", "\n", "==== scCellFie Pipeline: Processing completed successfully ====\n" ] } ], "source": [ "results = sccellfie.run_sccellfie_pipeline(adata, # Raw counts\n", " organism='human',\n", " sccellfie_data_folder=None,\n", " n_counts_col='n_counts', # Column where total counts per cells are stored in adata.obs\n", " process_by_group=False, # Whether to do the processing by cell groups\n", " groupby=None, # Column indicating cell groups if `process_by_group=True`\n", " neighbors_key='neighbors', # Neighbors information if precomputed. Otherwise, it will be computed here\n", " n_neighbors=10, # Number of neighbors to use\n", " batch_key=batch_key, # None if batches are not included\n", " threshold_key='sccellfie_threshold', # This is for using the default database. If personalized thresholds are used, specificy column name\n", " smooth_cells=True, # Whether to perform gene expression smoothing before running the tool\n", " alpha=0.33, # Importance of neighbors' expression for the smoothing (0 to 1)\n", " chunk_size=5000, # Number of chunks to run the processing steps (helps with the memory)\n", " disable_pbar=False, \n", " save_folder=None, # In case results will be saved. If so, results will not be returned and should be loaded from the folder (see sccellfie.io.load_data function\n", " save_filename=None # Name for saving the files, otherwise a default name will be used\n", " )" ] }, { "cell_type": "markdown", "id": "hawaiian-suffering", "metadata": {}, "source": [ "## Export results " ] }, { "cell_type": "code", "execution_count": 7, "id": "beautiful-waterproof", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['adata', 'gpr_rules', 'task_by_gene', 'rxn_by_gene', 'task_by_rxn', 'rxn_info', 'task_info', 'thresholds', 'organism'])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results.keys()" ] }, { "cell_type": "markdown", "id": "contained-skirt", "metadata": {}, "source": [ "To access metabolic activities, we need to inspect ``results['adata']``:\n", "\n", "- The processed single-cell data is located in the AnnData object ``results['adata']``.\n", "- The reaction activities for each cell are located in the AnnData object ``results['adata'].reactions``.\n", "- The metabolic task activities for each cell are located in the AnnData object ``results['adata'].metabolic_tasks``.\n", "\n", "In particular:\n", "\n", "- ``results['adata']``: contains gene expression in ``.X``.\n", "- ``results['adata'].layers['gene_scores']``: contains gene scores as in the original CellFie paper.\n", "- ``results['adata'].uns['Rxn-Max-Genes']``: contains determinant genes for each reaction per cell.\n", "- ``results['adata'].reactions``: contains reaction scores in ``.X`` so every scanpy function can be used on this object to visualize or compare values.\n", "- ``results['adata'].metabolic_tasks``: contains metabolic task scores in ``.X`` so every scanpy function can be used on this object to visualize or compare values.\n", "\n", "Other keys in the ``results`` dictionary are associated with the scCellFie database and are already filtered for the elements present\n", "in the dataset (``'gpr_rules'``, ``'task_by_gene'``, ``'rxn_by_gene'``, ``'task_by_rxn'``, ``'rxn_info'``, ``'task_info'``, ``'thresholds'``, ``'organism'``)." ] }, { "cell_type": "markdown", "id": "reserved-longitude", "metadata": {}, "source": [ "### Save single-cell results\n", "\n", "We can save our single-cell results contained in the AnnData objects (``results['adata']``) into a specific folder. Here we only need to specify the output folder, and the basename that our AnnData objects will have. This function below saves the expression object (``results['adata']``) and those containing scores for reactions and metabolic tasks (``results['adata'].reactions`` and ``results['adata'].metabolic_tasks``, respectively), as separate files." ] }, { "cell_type": "code", "execution_count": null, "id": "african-painting", "metadata": {}, "outputs": [], "source": [ "sccellfie.io.save_adata(adata=results['adata'], output_directory='./results/', filename='Human_COVID19_scCellFie')" ] }, { "cell_type": "markdown", "id": "funded-country", "metadata": {}, "source": [ "## Identification of metabolic task markers and visualization \n", "\n", "To identify markers, we can use different approaches. Here we show an approach based on [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), which comes from the Natural Language Processing field, and the [logistic regression implemented in Scanpy](https://www.nature.com/articles/s41592-018-0303-9)." ] }, { "cell_type": "code", "execution_count": 8, "id": "classified-alabama", "metadata": {}, "outputs": [], "source": [ "# Our column indicating the cell types in the adata.obs dataframe\n", "cell_group = 'celltype'" ] }, { "cell_type": "code", "execution_count": 9, "id": "overhead-personal", "metadata": {}, "outputs": [], "source": [ "# We use glasbey to expand the palette into a larger number of colors\n", "# This is useful when we have many cell types\n", "palette = glasbey.extend_palette('Set2', \n", " palette_size=max([10, results['adata'].metabolic_tasks.obs[cell_group].unique().shape[0]]))" ] }, { "cell_type": "markdown", "id": "korean-timothy", "metadata": {}, "source": [ "### Detection using TF-IDF" ] }, { "cell_type": "markdown", "id": "artificial-anime", "metadata": {}, "source": [ "scCellFie runs a TF-IDF approach, as implemented in the tool SoupX in R.\n", "We define a `express_cut=5*np.log(2)` to define cells expressing each metabolic task. Here, a value of `5*np.log(2)` indicates that all reactions in the task are over the threshold value of the determinant gene (or \"active\"). This condition is often rare in single-cell data due to data sparsity (rarely all reactions in a task are over their determinant gene's threshold), but it is a good approach for identifying markers." ] }, { "cell_type": "code", "execution_count": 10, "id": "cordless-memorabilia", "metadata": {}, "outputs": [], "source": [ "mrks = sccellfie.external.quick_markers(results['adata'].metabolic_tasks,\n", " cluster_key=cell_group, \n", " n_markers=20, \n", " express_cut=5*np.log(2))" ] }, { "cell_type": "code", "execution_count": 11, "id": "universal-disposal", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | gene | \n", "cluster | \n", "tf | \n", "idf | \n", "tf_idf | \n", "gene_frequency_outside_cluster | \n", "gene_frequency_global | \n", "second_best_tf | \n", "second_best_cluster | \n", "pval | \n", "qval | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "AMP salvage from adenine | \n", "B | \n", "0.212121 | \n", "2.324907 | \n", "0.493162 | \n", "0.097433 | \n", "0.097792 | \n", "0.242395 | \n", "NK | \n", "0.000001 | \n", "0.000000e+00 | \n", "
| 1 | \n", "Fructose degradation (to glucose-3-phosphate) | \n", "B | \n", "0.474747 | \n", "0.915665 | \n", "0.434710 | \n", "0.400016 | \n", "0.400250 | \n", "0.542776 | \n", "NK | \n", "0.019813 | \n", "2.608937e-66 | \n", "
| 2 | \n", "ATP generation from glucose (hypoxic condition... | \n", "B | \n", "0.282828 | \n", "1.489275 | \n", "0.421209 | \n", "0.225356 | \n", "0.225536 | \n", "0.458583 | \n", "T | \n", "0.034830 | \n", "0.000000e+00 | \n", "
| 3 | \n", "Glucose to lactate conversion | \n", "B | \n", "0.186869 | \n", "1.984669 | \n", "0.370872 | \n", "0.137270 | \n", "0.137426 | \n", "0.361966 | \n", "T | \n", "0.031173 | \n", "0.000000e+00 | \n", "
| 4 | \n", "Conversion of 1-phosphatidyl-1D-myo-inositol 4... | \n", "B | \n", "0.116162 | \n", "3.081438 | \n", "0.357945 | \n", "0.045672 | \n", "0.045893 | \n", "0.135940 | \n", "Epithelial | \n", "0.000045 | \n", "2.539957e-168 | \n", "
| \n", " | sender_celltype | \n", "receiver_celltype | \n", "ligand | \n", "receptor | \n", "score | \n", "ligand_fraction | \n", "receptor_fraction | \n", "
|---|---|---|---|---|---|---|---|
| 0 | \n", "B | \n", "B | \n", "Synthesis of estradiol-17beta (E2) from andros... | \n", "ESR1 | \n", "0.000000 | \n", "0.050505 | \n", "0.000000 | \n", "
| 1 | \n", "B | \n", "B | \n", "Synthesis of L-kynurenine from tryptophan | \n", "AHR | \n", "0.000000 | \n", "0.095960 | \n", "0.090909 | \n", "
| 2 | \n", "B | \n", "Epithelial | \n", "Synthesis of estradiol-17beta (E2) from andros... | \n", "ESR1 | \n", "0.000000 | \n", "0.050505 | \n", "0.011173 | \n", "
| 3 | \n", "B | \n", "Epithelial | \n", "Synthesis of L-kynurenine from tryptophan | \n", "AHR | \n", "0.124446 | \n", "0.095960 | \n", "0.414804 | \n", "
| 4 | \n", "B | \n", "Macrophages | \n", "Synthesis of estradiol-17beta (E2) from andros... | \n", "ESR1 | \n", "0.000000 | \n", "0.050505 | \n", "0.023146 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 195 | \n", "pDC | \n", "T | \n", "Synthesis of L-kynurenine from tryptophan | \n", "AHR | \n", "0.000000 | \n", "0.141892 | \n", "0.147710 | \n", "
| 196 | \n", "pDC | \n", "mDC | \n", "Synthesis of estradiol-17beta (E2) from andros... | \n", "ESR1 | \n", "0.000000 | \n", "0.108108 | \n", "0.007439 | \n", "
| 197 | \n", "pDC | \n", "mDC | \n", "Synthesis of L-kynurenine from tryptophan | \n", "AHR | \n", "0.095979 | \n", "0.141892 | \n", "0.318810 | \n", "
| 198 | \n", "pDC | \n", "pDC | \n", "Synthesis of estradiol-17beta (E2) from andros... | \n", "ESR1 | \n", "0.000000 | \n", "0.108108 | \n", "0.006757 | \n", "
| 199 | \n", "pDC | \n", "pDC | \n", "Synthesis of L-kynurenine from tryptophan | \n", "AHR | \n", "0.000000 | \n", "0.141892 | \n", "0.094595 | \n", "
200 rows × 7 columns
\n", "