Analyze Xenium data

import spatialdata as sd
from spatialdata_io import xenium

import matplotlib.pyplot as plt
import seaborn as sns

import scanpy as sc
import squidpy as sq

A reader for Xenium data is available in spatialdata-io. We use it to parse and convert to Zarr a Xenium dataset of Human Lung Cancer.

After downloading and extracting the dataset into a directory named Xenium, we specify the path to the dataset and where we want to store our Zarr files.

xenium_path = "./Xenium"
zarr_path = "./Xenium.zarr"

load the data into a spatialdata object using the xenium reader.

sdata = xenium(xenium_path)

Convert to Zarr.

xenium.write(zarr_path)

From now on we only read directly from the zarr store.

sdata = sd.read_zarr(zarr_path)
sdata
SpatialData object with:
├── Images
│     └── 'morphology_focus': MultiscaleSpatialImage[cyx] (1, 20535, 51103), (1, 10267, 25551), (1, 5133, 12775), (1, 2566, 6387), (1, 1283, 3193)
├── Labels
│     ├── 'cell_labels': MultiscaleSpatialImage[yx] (20535, 51103), (10267, 25551), (5133, 12775), (2566, 6387), (1283, 3193)
│     └── 'nucleus_labels': MultiscaleSpatialImage[yx] (20535, 51103), (10267, 25551), (5133, 12775), (2566, 6387), (1283, 3193)
├── Points
│     └── 'transcripts': DataFrame with shape: (40257199, 11) (3D points)
├── Shapes
│     ├── 'cell_boundaries': GeoDataFrame shape: (161000, 1) (2D shapes)
│     ├── 'cell_circles': GeoDataFrame shape: (161000, 2) (2D shapes)
│     └── 'nucleus_boundaries': GeoDataFrame shape: (161000, 1) (2D shapes)
└── Tables
      └── 'table': AnnData (161000, 480)
with coordinate systems:
▸ 'global', with elements:
        morphology_focus (Images), cell_labels (Labels), nucleus_labels (Labels), transcripts (Points), cell_boundaries (Shapes), cell_circles (Shapes), nucleus_boundaries (Shapes)

For the analysis we use the anndata.AnnData object, which contains the count matrix, cell and gene annotations. It is stored in the spatialdata.tables layer.

adata = sdata.tables["table"]
adata
AnnData object with n_obs × n_vars = 161000 × 480
    obs: 'cell_id', 'transcript_counts', 'control_probe_counts', 'control_codeword_counts', 'unassigned_codeword_counts', 'deprecated_codeword_counts', 'total_counts', 'cell_area', 'nucleus_area', 'region', 'z_level', 'nucleus_count', 'cell_labels'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'spatialdata_attrs'
    obsm: 'spatial'
adata.obs
cell_id transcript_counts control_probe_counts control_codeword_counts unassigned_codeword_counts deprecated_codeword_counts total_counts cell_area nucleus_area region z_level nucleus_count cell_labels
0 aaaakjjo-1 74 0 0 0 0 74 147.254537 29.622501 cell_circles 4.0 1.0 1
1 aaaaoeme-1 55 0 0 0 0 55 118.309379 6.231563 cell_circles 5.0 1.0 2
2 aaaaofcm-1 188 0 0 0 0 188 203.609539 35.583126 cell_circles 5.0 1.0 3
3 aaabeahl-1 125 0 0 0 0 125 120.205942 21.087970 cell_circles 5.0 1.0 4
4 aaabkjgc-1 56 0 0 0 0 56 118.038442 13.230782 cell_circles 4.0 1.0 5
... ... ... ... ... ... ... ... ... ... ... ... ... ...
160995 oimkckeg-1 55 0 0 0 0 55 54.187502 20.591251 cell_circles 7.0 1.0 160996
160996 oimkjcem-1 134 0 0 0 0 134 104.762504 49.446096 cell_circles 7.0 1.0 160997
160997 oimklafm-1 84 0 0 0 0 84 109.187816 21.042813 cell_circles 7.0 1.0 160998
160998 oimkmnpi-1 61 0 0 0 0 61 45.923908 17.836719 cell_circles 7.0 1.0 160999
160999 oimknppn-1 91 0 0 0 0 91 114.425942 33.099532 cell_circles 7.0 1.0 161000

161000 rows × 13 columns

Squidpy looks for cell coordinates in .obsm["spatial"]. By using the Xenium reader from spatialdata-io this is already automatically set. For more complex data this needs to be set manually, more details can be found in this spatialdata tutorial on squidpy integration.

adata.obsm["spatial"]
array([[ 111.34860229,   28.03543472],
       [ 120.43323517,   75.61804199],
       [ 126.72792816,   56.90501404],
       ...,
       [7191.46337891,  398.10839844],
       [7185.16210938,  400.09918213],
       [7179.02294922,  394.43218994]])

Calculate quality control metrics

Calculate the quality control metrics on the anndata.AnnData using scanpy.pp.calculate_qc_metrics.

sc.pp.calculate_qc_metrics(adata, percent_top=(10, 20, 50, 150), inplace=True)

The percentage of control probes and control codewords can be calculated from adata.obs

cprobes = (
    adata.obs["control_probe_counts"].sum() / adata.obs["total_counts"].sum() * 100
)
cwords = (
    adata.obs["control_codeword_counts"].sum() / adata.obs["total_counts"].sum() * 100
)
print(f"Negative DNA probe count % : {cprobes}")
print(f"Negative decoding count % : {cwords}")
Negative DNA probe count % : 0.005055624161283524
Negative decoding count % : 0.0025133467754592624

Next we plot the distribution of total transcripts per cell, unique transcripts per cell, area of segmented cells and the ratio of nuclei area to their cells

fig, axs = plt.subplots(1, 4, figsize=(15, 4))

axs[0].set_title("Total transcripts per cell")
sns.histplot(
    adata.obs["total_counts"],
    kde=False,
    ax=axs[0],
)

axs[1].set_title("Unique transcripts per cell")
sns.histplot(
    adata.obs["n_genes_by_counts"],
    kde=False,
    ax=axs[1],
)


axs[2].set_title("Area of segmented cells")
sns.histplot(
    adata.obs["cell_area"],
    kde=False,
    ax=axs[2],
)

axs[3].set_title("Nucleus ratio")
sns.histplot(
    adata.obs["nucleus_area"] / adata.obs["cell_area"],
    kde=False,
    ax=axs[3],
)
<Axes: title={'center': 'Nucleus ratio'}, ylabel='Count'>
../../_images/317ade47d74f0e55eb59185c4af5520a92dfff73e36ab12c7756038e268f49dc.png

Filter the cells based on the minimum number of counts required using scanpy.pp.filter_cells. Filter the genes based on the minimum number of cells required with scanpy.pp.filter_genes. The parameters for the both were specified based on the plots above. They were set to filter out the cells and genes with minimum counts and minimum cells respectively.

Other filter criteria might be cell area, DAPI signal or a minimum of unique transcripts.

sc.pp.filter_cells(adata, min_counts=10)
sc.pp.filter_genes(adata, min_cells=5)

Normalize counts per cell using scanpy.pp.normalize_total.

Logarithmize, do principal component analysis, compute a neighborhood graph of the observations using scanpy.pp.log1p, scanpy.pp.pca and scanpy.pp.neighbors respectively.

Use scanpy.tl.umap to embed the neighborhood graph of the data and cluster the cells into subgroups employing scanpy.tl.leiden.

adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, inplace=True)
sc.pp.log1p(adata)
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.leiden(adata)

Visualize annotation on UMAP and spatial coordinates

Subplot with scatter plot in UMAP (Uniform Manifold Approximation and Projection) basis. The embedded points were colored, respectively, according to the total counts, number of genes by counts, and leiden clusters in each of the subplots. This gives us some idea of what the data looks like.

sc.pl.umap(
    adata,
    color=[
        "total_counts",
        "n_genes_by_counts",
        "leiden",
    ],
    wspace=0.4,
)
../../_images/0b7ddee5731940f15ae76afb2be1423b6cd7dba115893e93fa590dceaabc99d1.png
sq.pl.spatial_scatter(
    adata,
    library_id="spatial",
    shape=None,
    color=[
        "leiden",
    ],
    wspace=0.4,
)
../../_images/5cbaf5dc68aec6fe094167ebac484948f88e2c1a7030d1dfc6efa5bcb10820d8.png

Computation of spatial statistics

This example shows how to compute centrality scores, given a spatial graph and cell type annotation.

The scores calculated are closeness centrality, degree centrality and clustering coefficient with the following properties:

  • closeness centrality - measure of how close the group is to other nodes.

  • clustering coefficient - measure of the degree to which nodes cluster together.

  • degree centrality - fraction of non-group members connected to group members.

All scores are descriptive statistics of the spatial graph.

This dataset contains Leiden cluster groups’ annotations in adata.obs, which are used for calculation of centrality scores.

Building a spatial neighborhood graph

First, we need to compute a connectivity matrix from spatial coordinates to calculate the centrality scores. We can use squidpy.gr.spatial_neighbors for this purpose. We use the coord_type="generic" based on the data and the neighbors are classified with Delaunay triangulation by specifying delaunay=True.

sq.gr.spatial_neighbors(adata, coord_type="generic", delaunay=True)

Compute centrality scores

Centrality scores are calculated with squidpy.gr.centrality_scores, with the Leiden groups as clusters.

sq.gr.centrality_scores(adata, cluster_key="leiden")

The results were visualized by plotting the average centrality, closeness centrality, and degree centrality using squidpy.pl.centrality_scores.

sq.pl.centrality_scores(adata, cluster_key="leiden", figsize=(16, 5))
../../_images/335b5665de12ce214c0abf0484e24dbc2ffc60c5f7c08c0576119891fbed4689.png

Compute co-occurrence probability

This example shows how to compute the co-occurrence probability.

The co-occurrence score is defined as:

\(\frac{p(exp|cond)}{p(exp)}\)

where \(p(exp|cond)\) is the conditional probability of observing a cluster \(exp\) conditioned on the presence of a cluster \(cond\), whereas \(p(exp)\) is the probability of observing \(exp\) in the radius size of interest. The score is computed across increasing radii size around each cell in the tissue.

We can compute the co-occurrence score with squidpy.gr.co_occurrence. Results of co-occurrence probability ratio can be visualized with squidpy.pl.co_occurrence. The ‘3’ in the \(\frac{p(exp|cond)}{p(exp)}\) represents a Leiden clustered group.

We can further visualize tissue organization in spatial coordinates with squidpy.pl.spatial_scatter, with an overlay of the expressed genes which were colored in consonance with the Leiden clusters.

First we create a new table, which is a subset of the original Anndata object, and store it in tables.

sdata.tables["subsample"] = sc.pp.subsample(adata, fraction=0.5, copy=True)

You can also work with sdata.tables["subsample"] directly, but to keep verbosity low, we just use the anndata.Anndata object as before.

adata_subsample = sdata.tables["subsample"]
sq.gr.co_occurrence(
    adata_subsample,
    cluster_key="leiden",
)
sq.pl.co_occurrence(
    adata_subsample,
    cluster_key="leiden",
    clusters="12",
    figsize=(10, 10),
)
sq.pl.spatial_scatter(
    adata_subsample,
    color="leiden",
    shape=None,
    size=2,
)
WARNING: `n_splits` was automatically set to `39` to prevent `79670x79670` distance matrix from being created
  0%|          | 0/780 [00:00<?, ?/s]
100%|██████████| 780/780 [17:39<00:00,  1.36s/]
WARNING: Please specify a valid `library_id` or set it permanently in `adata.uns['spatial']`
../../_images/1fe9afb04453a5f2f7fe0c8d35aebba9abe37903ebd951e0a7c81c062fb4b9f6.png ../../_images/e740c4c101068f9d104ba3b9ae96626142fe6d6790b76dd54515420bf29a5cb1.png

Neighbors enrichment analysis

This example shows how to run the neighbors enrichment analysis routine.

It calculates an enrichment score based on proximity on the connectivity graph of cell clusters. The number of observed events is compared against permutations and a z-score is computed.

This dataset contains cell type annotations in adata.obs which are used for calculation of the neighborhood enrichment. We calculate the neighborhood enrichment score with squidpy.gr.nhood_enrichment.

sq.gr.nhood_enrichment(adata, cluster_key="leiden")
100%|██████████| 1000/1000 [00:21<00:00, 46.11/s]

And visualize the results with squidpy.pl.nhood_enrichment.

fig, ax = plt.subplots(1, 2, figsize=(13, 7))
sq.pl.nhood_enrichment(
    adata,
    cluster_key="leiden",
    figsize=(8, 8),
    title="Neighborhood enrichment adata",
    ax=ax[0],
)
sq.pl.spatial_scatter(adata_subsample, color="leiden", shape=None, size=2, ax=ax[1])
WARNING: Please specify a valid `library_id` or set it permanently in `adata.uns['spatial']`
../../_images/841fbfc4572bbec16d020c4a35d4527074464c0e147ae70bc6ff56bf687b3cb4.png

Compute Moran’s I score

This example shows how to compute the Moran’s I global spatial auto-correlation statistics.

The Moran’s I global spatial auto-correlation statistics evaluates whether features (i.e. genes) shows a pattern that is clustered, dispersed or random in the tissue are under consideration.

We can compute the Moran’s I score with squidpy.gr.spatial_autocorr and mode = ‘moran’. We first need to compute a spatial graph with squidpy.gr.spatial_neighbors. We will also subset the number of genes to evaluate.

sq.gr.spatial_neighbors(adata_subsample, coord_type="generic", delaunay=True)
sq.gr.spatial_autocorr(
    adata_subsample,
    mode="moran",
    n_perms=100,
    n_jobs=1,
)
adata_subsample.uns["moranI"].head(10)
100%|██████████| 100/100 [05:49<00:00,  3.49s/]
I pval_norm var_norm pval_z_sim pval_sim var_sim pval_norm_fdr_bh pval_z_sim_fdr_bh pval_sim_fdr_bh
AREG 0.695981 0.0 0.000004 0.0 0.009901 0.000009 0.0 0.0 0.009984
MET 0.683335 0.0 0.000004 0.0 0.009901 0.000010 0.0 0.0 0.009984
ANXA1 0.666874 0.0 0.000004 0.0 0.009901 0.000009 0.0 0.0 0.009984
EPCAM 0.633024 0.0 0.000004 0.0 0.009901 0.000007 0.0 0.0 0.009984
DMBT1 0.587696 0.0 0.000004 0.0 0.009901 0.000009 0.0 0.0 0.009984
IGKC 0.549282 0.0 0.000004 0.0 0.009901 0.000006 0.0 0.0 0.009984
IGHG1 0.518234 0.0 0.000004 0.0 0.009901 0.000006 0.0 0.0 0.009984
IDO1 0.496409 0.0 0.000004 0.0 0.009901 0.000006 0.0 0.0 0.009984
SPARC 0.445792 0.0 0.000004 0.0 0.009901 0.000005 0.0 0.0 0.009984
APOE 0.437499 0.0 0.000004 0.0 0.009901 0.000008 0.0 0.0 0.009984

We can visualize some of those genes with squidpy.pl.spatial_scatter. We could also pass mode = 'geary' to compute a closely related auto-correlation statistic, Geary’s C. See squidpy.gr.spatial_autocorr for more information.

sq.pl.spatial_scatter(
    adata_subsample,
    library_id="spatial",
    color=[
        "AREG",
        "MET",
    ],
    shape=None,
    size=2,
    img=False,
)
../../_images/5d153630a351b9ccd355b799fa717f63201831984bce3e25321b176c8230ba94.png

For plotting we can also use spatialdata-plot, a package for static plotting of spatialdataobjects.

import spatialdata_plot

gene_name = ["AREG", "MET"]
for name in gene_name:
    sdata.pl.render_images("morphology_focus").pl.render_shapes(
        "cell_circles",
        color=name,
        table_name="table",
        use_raw=False,
    ).pl.show(
        title=f"{name} expression over Morphology image",
        coordinate_systems="global",
        figsize=(10, 5),
    )
../../_images/6c56f0e0ec93abe53e2b0ff550850119bb5d60178776a2a436f4197b6417a7c6.png ../../_images/0ed3c935655eb109a85a5a1881bbc48bd80dc4c3da4062d95d4f64c26f19b39c.png

Interactive analysis and visualization with napari-spatialdata

Additionally, one can use napari-spatialdata to visualize Xenium data with an interactive GUI.

from napari_spatialdata import Interactive

Interactive(sdata)

Here we visualize AREG expression across all cells:

title

For a detailed view on how spatialdata can be used to analyze Xenium data, please refer to the corresponding SpatialData tutorial.