Analyze Xenium data

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import scanpy as sc
import squidpy as sq

Download the Feature-cell Matrix (HDF5) and the Cell summary file (CSV) from the Xenium breast cancer tumor microenvironment Dataset.

You need these 2 files in a new folder tutorial_data in the same path as your notebook.

  • Xenium_FFPE_Human_Breast_Cancer_Rep1_cell_feature_matrix.h5

  • Xenium_FFPE_Human_Breast_Cancer_Rep1_cells.csv

The following lines create the folder structure which can be used to load the data.

# !mkdir tutorial_data
# !mkdir tutorial_data/xenium_data
# !wget -P tutorial_data/xenium_data/ https://cf.10xgenomics.com/samples/xenium/preview/Xenium_FFPE_Human_Breast_Cancer_Rep1/Xenium_FFPE_Human_Breast_Cancer_Rep1_cell_feature_matrix.h5
# !wget -P tutorial_data/xenium_data/ https://cf.10xgenomics.com/samples/xenium/preview/Xenium_FFPE_Human_Breast_Cancer_Rep1/Xenium_FFPE_Human_Breast_Cancer_Rep1_cells.csv.gz
# !tar -xzf tutorial_data/xenium_data/Xenium_FFPE_Human_Breast_Cancer_Rep1_cells.csv.gz -C tutorial_data/xenium_data/
adata = sc.read_10x_h5(
    filename="tutorial_data/xenium_data/Xenium_FFPE_Human_Breast_Cancer_Rep1_cell_feature_matrix.h5"
)
df = pd.read_csv(
    "tutorial_data/xenium_data/Xenium_FFPE_Human_Breast_Cancer_Rep1_cells.csv"
)
df.set_index(adata.obs_names, inplace=True)
adata.obs = df.copy()
adata.obsm["spatial"] = adata.obs[["x_centroid", "y_centroid"]].copy().to_numpy()
adata.obs
cell_id x_centroid y_centroid transcript_counts control_probe_counts control_codeword_counts total_counts cell_area nucleus_area n_genes_by_counts log1p_n_genes_by_counts log1p_total_counts pct_counts_in_top_10_genes pct_counts_in_top_20_genes pct_counts_in_top_50_genes pct_counts_in_top_150_genes
1 1 377.663005 843.541888 154 0 0 154.0 110.361875 45.562656 62 4.143135 5.043425 55.844156 71.428571 92.207792 100.0
2 2 382.078658 858.944818 64 0 0 64.0 87.919219 24.248906 40 3.713572 4.174387 48.437500 68.750000 100.000000 100.0
3 3 319.839529 869.196542 57 0 0 57.0 52.561875 23.526406 41 3.737670 4.060443 43.859649 63.157895 100.000000 100.0
4 4 259.304707 851.797949 120 0 0 120.0 75.230312 35.176719 50 3.931826 4.795791 49.166667 71.666667 100.000000 100.0
5 5 370.576291 865.193024 120 0 0 120.0 180.218594 34.499375 65 4.189655 4.795791 42.500000 62.500000 87.500000 100.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
167778 167778 7455.404785 5115.021094 238 1 0 238.0 219.956094 61.412500 77 4.356709 5.476463 43.277311 63.445378 88.655462 100.0
167779 167779 7483.771045 5111.720703 80 0 0 80.0 38.427969 25.964844 36 3.610918 4.394449 55.000000 80.000000 100.000000 100.0
167780 167780 7470.119580 5119.350366 406 0 0 406.0 287.690469 86.158125 75 4.330733 6.008813 55.911330 73.152709 93.842365 100.0
167781 167781 7477.704004 5128.963086 120 0 0 120.0 235.670469 25.016563 52 3.970292 4.795791 47.500000 70.000000 98.333333 100.0
167782 167782 7489.376562 5123.402393 393 0 0 393.0 269.447344 111.445625 78 4.369448 5.976351 45.292621 69.465649 92.875318 100.0

167782 rows × 16 columns

Calculate quality control metrics

Calculate the quality control metrics on the anndata.AnnData using scanpy.pp.calculate_qc_metrics.

sc.pp.calculate_qc_metrics(adata, percent_top=(10, 20, 50, 150), inplace=True)

The percentage of control probes and control codewords can be calculated from adata.obs

cprobes = (
    adata.obs["control_probe_counts"].sum() / adata.obs["total_counts"].sum() * 100
)
cwords = (
    adata.obs["control_codeword_counts"].sum() / adata.obs["total_counts"].sum() * 100
)
print(f"Negative DNA probe count % : {cprobes}")
print(f"Negative decoding count % : {cwords}")
Negative DNA probe count % : 0.08700852649698224
Negative decoding count % : 0.005482476054877954

Next we plot the distribution of total transcripts per cell, unique transcripts per cell, area of segmented cells and the ratio of nuclei area to their cells

fig, axs = plt.subplots(1, 4, figsize=(15, 4))

axs[0].set_title("Total transcripts per cell")
sns.histplot(
    adata.obs["total_counts"],
    kde=False,
    ax=axs[0],
)

axs[1].set_title("Unique transcripts per cell")
sns.histplot(
    adata.obs["n_genes_by_counts"],
    kde=False,
    ax=axs[1],
)


axs[2].set_title("Area of segmented cells")
sns.histplot(
    adata.obs["cell_area"],
    kde=False,
    ax=axs[2],
)

axs[3].set_title("Nucleus ratio")
sns.histplot(
    adata.obs["nucleus_area"] / adata.obs["cell_area"],
    kde=False,
    ax=axs[3],
)
<AxesSubplot: title={'center': 'Nucleus ratio'}, ylabel='Count'>
../../_images/706e9cfeb39eb7637c7bde51668a8f262d241a0bf8ab3c5c25409153927fd36a.png

Filter the cells based on the minimum number of counts required using scanpy.pp.filter_cells. Filter the genes based on the minimum number of cells required with scanpy.pp.filter_genes. The parameters for the both were specified based on the plots above. They were set to filter out the cells and genes with minimum counts and minimum cells respectively.

Other filter criteria might be cell area, DAPI signal or a minimum of unique transcripts.

sc.pp.filter_cells(adata, min_counts=10)
sc.pp.filter_genes(adata, min_cells=5)

Normalize counts per cell using scanpy.pp.normalize_total.

Logarithmize, do principal component analysis, compute a neighborhood graph of the observations using scanpy.pp.log1p, scanpy.pp.pca and scanpy.pp.neighbors respectively.

Use scanpy.tl.umap to embed the neighborhood graph of the data and cluster the cells into subgroups employing scanpy.tl.leiden.

adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, inplace=True)
sc.pp.log1p(adata)
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.leiden(adata)
c:\Users\laure\AppData\Local\Programs\Python\Python310\lib\site-packages\tqdm\auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Visualize annotation on UMAP and spatial coordinates

Subplot with scatter plot in UMAP (Uniform Manifold Approximation and Projection) basis. The embedded points were colored, respectively, according to the total counts, number of genes by counts, and leiden clusters in each of the subplots. This gives us some idea of what the data looks like.

sc.pl.umap(
    adata,
    color=[
        "total_counts",
        "n_genes_by_counts",
        "leiden",
    ],
    wspace=0.4,
)
../../_images/1249d9f0d97053e262fd9568034c76ea57a5b7579e590cb215c04dc2a8c498d6.png
sq.pl.spatial_scatter(
    adata,
    library_id="spatial",
    shape=None,
    color=[
        "leiden",
    ],
    wspace=0.4,
)
../../_images/9f83a7b44067cdf1f36877c6acdc432f15412f29c43af23ab52da155f79a6f43.png

Computation of spatial statistics

Building the spatial neighbors graphs

This example shows how to compute centrality scores, given a spatial graph and cell type annotation.

The scores calculated are closeness centrality, degree centrality and clustering coefficient with the following properties:

  • closeness centrality - measure of how close the group is to other nodes.

  • clustering coefficient - measure of the degree to which nodes cluster together.

  • degree centrality - fraction of non-group members connected to group members.

All scores are descriptive statistics of the spatial graph.

This dataset contains Leiden cluster groups’ annotations in anndata.AnnData.obs, which are used for calculation of centrality scores.

First, we need to compute a connectivity matrix from spatial coordinates to calculate the centrality scores. We can use squidpy.gr.spatial_neighbors for this purpose. We use the coord_type="generic" based on the data and the neighbors are classified with Delaunay triangulation by specifying delaunay=True.

sq.gr.spatial_neighbors(adata, coord_type="generic", delaunay=True)

Compute centrality scores

Centrality scores are calculated with squidpy.gr.centrality_scores, with the Leiden groups as clusters.

sq.gr.centrality_scores(adata, cluster_key="leiden")

The results were visualized by plotting the average centrality, closeness centrality, and degree centrality using squidpy.pl.centrality_scores.

sq.pl.centrality_scores(adata, cluster_key="leiden", figsize=(16, 5))
../../_images/5ab35e6ad84c7df586932a3e2341f9874b65f10ad90d7c13ccb24e41fcbeebff.png

Compute co-occurrence probability

This example shows how to compute the co-occurrence probability.

The co-occurrence score is defined as:

(1)\[\begin{equation} \frac{p(exp|cond)}{p(exp)} \end{equation}\]

where \(p(exp|cond)\) is the conditional probability of observing a cluster \(exp\) conditioned on the presence of a cluster \(cond\), whereas \(p(exp)\) is the probability of observing \(exp\) in the radius size of interest. The score is computed across increasing radii size around each cell in the tissue.

We can compute the co-occurrence score with squidpy.gr.co_occurrence. Results of co-occurrence probability ratio can be visualized with squidpy.pl.co_occurrence. The ‘3’ in the \(\frac{p(exp|cond)}{p(exp)}\) represents a Leiden clustered group.

We can further visualize tissue organization in spatial coordinates with squidpy.pl.spatial_scatter, with an overlay of the expressed genes which were colored in consonance with the Leiden clusters.

adata_subsample = sc.pp.subsample(adata, fraction=0.5, copy=True)
sq.gr.co_occurrence(
    adata_subsample,
    cluster_key="leiden",
)
sq.pl.co_occurrence(
    adata_subsample,
    cluster_key="leiden",
    clusters="12",
    figsize=(10, 10),
)
sq.pl.spatial_scatter(
    adata_subsample,
    color="leiden",
    shape=None,
    size=2,
)
WARNING: `n_splits` was automatically set to `41` to prevent `82039x82039` distance matrix from being created
100%|██████████| 861/861 [15:08<00:00,  1.06s/]
WARNING: Please specify a valid `library_id` or set it permanently in `adata.uns['spatial']`
../../_images/1696ff49308c7e4f98a76c6bf552aa1f670722a236b0b71d07cddfbf1983c7fb.png ../../_images/0980ee757a69fa228629498970a7ee55dcdad23bcb4adaa6b169c4eed7198b86.png

Neighbors enrichment analysis

This example shows how to run the neighbors enrichment analysis routine.

It calculates an enrichment score based on proximity on the connectivity graph of cell clusters. The number of observed events is compared against \(N\) permutations and a z-score is computed.

This dataset contains cell type annotations in anndata.Anndata.obs which are used for calculation of the neighborhood enrichment. We calculate the neighborhood enrichment score with squidpy.gr.nhood_enrichment.

sq.gr.nhood_enrichment(adata, cluster_key="leiden")
100%|██████████| 1000/1000 [00:25<00:00, 38.50/s]

And visualize the results with squidpy.pl.nhood_enrichment.

fig, ax = plt.subplots(1, 2, figsize=(13, 7))
sq.pl.nhood_enrichment(
    adata,
    cluster_key="leiden",
    figsize=(8, 8),
    title="Neighborhood enrichment adata",
    ax=ax[0],
)
sq.pl.spatial_scatter(adata_subsample, color="leiden", shape=None, size=2, ax=ax[1])
WARNING: Please specify a valid `library_id` or set it permanently in `adata.uns['spatial']`
../../_images/fcf7f046f31b5c377379c3031d8c943a625efd30900b813568d20154fbb3a6e5.png

Compute Moran’s I score

This example shows how to compute the Moran’s I global spatial auto-correlation statistics.

The Moran’s I global spatial auto-correlation statistics evaluates whether features (i.e. genes) shows a pattern that is clustered, dispersed or random in the tissue are under consideration.

We can compute the Moran’s I score with squidpy.gr.spatial_autocorr and mode = 'moran'. We first need to compute a spatial graph with squidpy.gr.spatial_neighbors. We will also subset the number of genes to evaluate.

sq.gr.spatial_neighbors(adata_subsample, coord_type="generic", delaunay=True)
sq.gr.spatial_autocorr(
    adata_subsample,
    mode="moran",
    n_perms=100,
    n_jobs=1,
)
adata_subsample.uns["moranI"].head(10)
100%|██████████| 100/100 [34:39<00:00, 20.80s/]
I pval_norm var_norm pval_z_sim pval_sim var_sim pval_norm_fdr_bh pval_z_sim_fdr_bh pval_sim_fdr_bh
KRT7 0.696668 0.0 0.000004 0.0 0.009901 0.000008 0.0 0.0 0.009965
SCD 0.671975 0.0 0.000004 0.0 0.009901 0.000009 0.0 0.0 0.009965
FOXA1 0.659014 0.0 0.000004 0.0 0.009901 0.000008 0.0 0.0 0.009965
FASN 0.653400 0.0 0.000004 0.0 0.009901 0.000006 0.0 0.0 0.009965
EPCAM 0.641954 0.0 0.000004 0.0 0.009901 0.000009 0.0 0.0 0.009965
TACSTD2 0.638978 0.0 0.000004 0.0 0.009901 0.000007 0.0 0.0 0.009965
CEACAM6 0.631554 0.0 0.000004 0.0 0.009901 0.000009 0.0 0.0 0.009965
ERBB2 0.628083 0.0 0.000004 0.0 0.009901 0.000009 0.0 0.0 0.009965
KRT8 0.619555 0.0 0.000004 0.0 0.009901 0.000008 0.0 0.0 0.009965
LUM 0.593700 0.0 0.000004 0.0 0.009901 0.000007 0.0 0.0 0.009965

We can visualize some of those genes with squidpy.pl.spatial_scatter. We could also pass mode = 'geary' to compute a closely related auto-correlation statistic, Geary’s C. See squidpy.gr.spatial_autocorr for more information.

sq.pl.spatial_scatter(
    adata_subsample,
    library_id="spatial",
    color=[
        "KRT7",
        "FOXA1",
    ],
    shape=None,
    size=2,
    img=False,
)
../../_images/0a32d954e10911fb9eef2ffea259f99e0502b7029ae39409cf206888b4c7fa1d.png