Vizgen Mouse Liver Squidpy Vignette

This vignette shows how to use Squidpy and Scanpy to analyze MERFISH data from the Vizgen MERFISH Mouse Liver Map. This notebook analyzes the Liver1Slice1 MERFISH dataset that measures 347 genes across over >300,000 liver cells in a single mouse liver slice.

from copy import deepcopy

import numpy as np
import pandas as pd
from scipy.cluster import hierarchy as sch

from matplotlib import pyplot as plt

import scanpy as sc
import squidpy as sq

%matplotlib inline

Single-Cell Clustering of Vizgen MERFISH Mouse Liver Data

Obtain Data from Vizgen

We will use the Liver1Slice1 dataset from Vizgen’s MERFISH Mouse Liver Map: In order to run this tutorial we will download the cell_by_gene.csv and meta_cell.csv. Please follow the instructions to obtain access to the showcase data and download the data - here we save the data to a directory called tutorial_data/ in the same directory as this notebook.

# confirm the contents of the directory tutorial_data
!ls tutorial_data/
Liver1Slice1_cell_by_gene.csv  Liver1Slice1_cell_metadata.csv

Load MERFISH Data into AnnData Object

First, we load the cell-by-gene and cell-metadata CSVs and construct an AnnData object adata.

adata =
AnnData object with n_obs × n_vars = 395215 × 385
    obs: 'fov', 'volume', 'min_x', 'max_x', 'min_y', 'max_y'
    uns: 'spatial'
    obsm: 'spatial'

Make gene names unique and calculate QC metrics

adata.var["mt"] = adata.var_names.str.startswith("mt-")
    adata, qc_vars=["mt"], percent_top=(50, 100, 200, 300), inplace=True
OMP: Info #270: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

Filter cells with low expression and genes that are expressed in too few cells.

sc.pp.filter_cells(adata, min_counts=50)
sc.pp.filter_genes(adata, min_cells=10)

Data Pre-processing

Here we use Scanpy total-count normalize, logarithmize, and scale gene expression to unit variance (clipping values that exceed 10 standard deviations).

print("normalize total")
print("log transform")
sc.pp.scale(adata, max_value=10)
normalize total
log transform

Dimensionality Reduction, Neighbor Calculation, and Clustering

Here we use Scanpy to: reduce the dimensionality of our data by running principal component analysis (PCA), calculate the neighborhood graph of cells in PCA space, display cells in a two-dimensional UMAP embedding, and finally identify clusters of cells using Leiden clustering.

resolution = 1.5
print("PCA"), svd_solver="arpack")
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=20)
print("Leiden"), resolution=resolution)

UMAP with Leiden Clustering Labels

Here we visualize the distributions of the Leiden clusters in the UMAP plot. We see Leiden clusters tend to segregate into distinct regions within the UMAP plot.

sc.set_figure_params(figsize=(10, 10)), color=["leiden"], size=5)

Spatial Distributions of Cells

Here we visualize the spatial locations of cells in the mouse liver colored by Leiden cluster. We observe distinct spatial localizations of Leiden clusters throughout the tissue. We see some clusters line blood vessels and while others form concentric patterns reflecting hepatic zonation (Cunningham et. al. 2021). Our next step is to assign tentative cell type to our Leiden clusters and assess their spatial localizations in the liver.
    adata, shape=None, color="leiden", size=0.5, library_id="spatial", figsize=(10, 10)

Assign Cell Types

Reference Cell Type Marker Gene Sets

In order to tentatively assign liver cell types we utilize a gene-level cell type marker reference from the publication Spatial transcriptome profiling by MERFISH reveals fetal liver hematopoietic stem cell niche architecture. These marker genes will be used to assess cell type composition of the Leiden clusters. See MERFISH gene panel metadata:

gene_panel = ""
df_ref_panel_ini = pd.read_excel(gene_panel, index_col=0)
df_ref_panel = df_ref_panel_ini.iloc[1:, :1] = None
df_ref_panel.columns = ["Function"]

# Assign marker gene metadata using reference dataset
marker_genes = df_ref_panel[

meta_gene = deepcopy(adata.var)
common_marker_genes = list(set(meta_gene.index.tolist()).intersection(marker_genes))
meta_gene.loc[common_marker_genes, "Markers"] = df_ref_panel.loc[
    common_marker_genes, "Function"
meta_gene["Markers"] = meta_gene["Markers"].apply(
    lambda x: "N.A." if "marker" not in str(x) else x
N.A.                            323
HSC marker                        8
Hepatocyte marker                 5
Erythroid progenitor marker       5
Macrophage marker                 5
SEC marker                        5
Neutrophil marker                 5
MK marker                         5
Potential HSC marker              5
Pre-B cell marker                 5
Erythroid cell marker             4
EC marker                         4
AEC marker                        2
Myeloid marker                    2
Macrophage/Hepatocyte marker      1
Basophil marker                   1
Name: Markers, dtype: int64

Calculate Leiden Cluster Average Expression Signatures

Here we calculate the average gene expression signatures of the Leiden clusters, which will be used to assign cell type composition.

ser_counts = adata.obs["leiden"].value_counts() = "cell counts"
meta_leiden = pd.DataFrame(ser_counts)

cat_name = "leiden"
sig_leiden = pd.DataFrame(
    columns=adata.var_names, index=adata.obs[cat_name].cat.categories
for clust in adata.obs[cat_name].cat.categories:
    sig_leiden.loc[clust] = adata[adata.obs[cat_name].isin([clust]), :].X.mean(0)
sig_leiden = sig_leiden.transpose()
leiden_clusters = ["Leiden-" + str(x) for x in sig_leiden.columns.tolist()]
sig_leiden.columns = leiden_clusters
meta_leiden.index = sig_leiden.columns.tolist()
meta_leiden["leiden"] = pd.Series(
    meta_leiden.index.tolist(), index=meta_leiden.index.tolist()

Assign Cell Type Based on Top Expressed Marker Genes

Here we assign cell type composition to the Leiden clusters by counting the frequency of cell type marker genes in the top 30 most up-regulated genes for each cluster such that cell type is assigned based on the most frequently occuring marker genes. If there is a tie in the number of marker genes, we assign the cluster to more than one cell type. Using this approach eleven Leiden clusters are assigned to be Hepatocyte containing clusters.

meta_gene = pd.DataFrame(index=sig_leiden.index.tolist())
meta_gene["info"] = pd.Series("", index=meta_gene.index.tolist())
meta_gene["Markers"] = pd.Series("N.A.", index=sig_leiden.index.tolist())
meta_gene.loc[common_marker_genes, "Markers"] = df_ref_panel.loc[
    common_marker_genes, "Function"

meta_leiden["Cell_Type"] = pd.Series("N.A.", index=meta_leiden.index.tolist())
num_top_genes = 30
for inst_cluster in sig_leiden.columns.tolist():
    top_genes = (

    inst_ser = meta_gene.loc[top_genes, "Markers"]
    inst_ser = inst_ser[inst_ser != "N.A."]
    ser_counts = inst_ser.value_counts()

    max_count = ser_counts.max()

    max_cat = "_".join(sorted(ser_counts[ser_counts == max_count].index.tolist()))
    max_cat = max_cat.replace(" marker", "").replace(" ", "-")

    print(inst_cluster, max_cat)
    meta_leiden.loc[inst_cluster, "Cell_Type"] = max_cat

# rename clusters
meta_leiden["name"] = meta_leiden.apply(
    lambda x: x["Cell_Type"] + "_" + x["leiden"], axis=1
leiden_names = meta_leiden["name"].values.tolist()
meta_leiden.index = leiden_names

# transfer cell type labels to single cells
leiden_to_cell_type = deepcopy(meta_leiden)
leiden_to_cell_type.set_index("leiden", inplace=True) = None

adata.obs["Cell_Type"] = adata.obs["leiden"].apply(
    lambda x: leiden_to_cell_type.loc["Leiden-" + str(x), "Cell_Type"]
adata.obs["Cluster"] = adata.obs["leiden"].apply(
    lambda x: leiden_to_cell_type.loc["Leiden-" + str(x), "name"]
Leiden-0 Hepatocyte
Leiden-1 Hepatocyte
Leiden-2 Hepatocyte
Leiden-3 Hepatocyte_SEC
Leiden-4 Hepatocyte
Leiden-5 SEC
Leiden-6 Hepatocyte
Leiden-7 Hepatocyte
Leiden-8 Macrophage
Leiden-9 Hepatocyte
Leiden-10 Erythroid-cell_Erythroid-progenitor_Pre-B-cell
Leiden-11 AEC_Potential-HSC
Leiden-12 SEC
Leiden-13 Macrophage
Leiden-14 AEC_Hepatocyte
Leiden-15 SEC
Leiden-16 MK
Leiden-17 HSC
Leiden-18 Neutrophil
Leiden-19 HSC_Pre-B-cell
Leiden-20 Neutrophil
Leiden-21 HSC_Myeloid
Leiden-22 Hepatocyte_MK
Leiden-23 Hepatocyte_SEC
Leiden-24 SEC
Leiden-25 Hepatocyte
Leiden-26 Erythroid-cell_MK

Hepatocyte Zonation

Hepatocytes are the most abundant cell in the liver and have multiple roles in metabolism, endocrine production, protein synthesis, and detoxification. Hepatocytes form complex, radial structures called lobules that contain a central vein (with low blood oxygen level) surrounded by peripheral portal veins (with high blood oxygen level). Hepatocytes can also be broadly classified as peri-central or peri-portal based on their proximity to central and portal veins, respectively.

Central and Portal Blood Vessels

We can use the gene Vwf to identify endothelial cells (Horvath et. al. 2004) that line liver blood vessels. Plotting Vwf expression level in single cells (below) shows clear enrichment at blood vessel borders. Note that Vwf expression can also be used to distinguish artifactual holes/tears in the liver from blood vessels.

Next, we use the expression of Axin2 to mark peri-central regions (Sun et. al. 2020). Plotting Axin2 expression level in single cells shows Axin2 lining a subset of all Vwf positive blood vessels, which allows us to distinguish peri-portal (Axin2 negative) and peri-central (Axin2 positive) blood vessels. Similarly, we can use Axin2 expression to identify peri-central hepatocyte Leiden clusters (see next section).
    adata, color=["Vwf", "Axin2"], size=15, cmap="Reds", img=False, figsize=(12, 8)

Distinguishing Peri-Portal and Peri-Central Hepatocytes

As described above, we use the expression of Axin2 as a marker for peri-central hepatocytes (see Sun et. al.).

all_hepatocyte_clusters = [x for x in meta_leiden.index.tolist() if "Hepatocyte" in x]
sig_leiden.columns = meta_leiden.index.tolist()
ser_axin2 = sig_leiden[all_hepatocyte_clusters].loc["Axin2"]
peri_central = ser_axin2[ser_axin2 > 0].index.tolist()
peri_portal = ser_axin2[ser_axin2 <= 0].index.tolist()

Peri-Central Hepatocytes

Plotting peri-central and peri-portal hepatocytes separately displays their distinct interlocking morphologies with respect to blood vessels.
    adata, groups=peri_central, color="Cluster", size=15, img=False, figsize=(15, 15)

Peri-Portal Hepatocytes
    adata, groups=peri_portal, color="Cluster", size=15, img=False, figsize=(15, 15)

Neighborhood Enrichment

In this section we will use Squidpy to identify clusters that are spatially enriched for one another using a neighborhood enrichment test This test determines if cells belonging to two different clusters are close to each other more often than expected.

In order to run this test we first have to calculate a connectivity graph using the method. This graph consists of cells (nodes) and cell-cell interactions (edges).

We also visualize the neighborhood enrichment using a hierarchically clustered heatmap which shows clusters of enriched neighborhoods in our tissue., coord_type="generic", spatial_key="spatial"), cluster_key="leiden")
    figsize=(5, 5),
/Users/nickfernandez/anaconda3/envs/squidpy-env/lib/python3.8/site-packages/squidpy/pl/ MatplotlibDeprecationWarning: In a future version, 'pad' will default to rcParams['figure.subplot.hspace'].  Set pad=0 to keep the old behavior.
  col_ax = divider.append_axes("top", size="5%")

Neighborhood Enrichment Clusters

Here we visualize clusters from our neighborhood enrichment data obtained by hierarchically clustering the Z-scored neighborhood enrichment scores. We observe a cluster containing three peri-portal hepatocyte leiden clusters (Hepatocyte_Leiden-0, Hepatocyte_Leiden-4, and Hepatocyte_Leiden-6 ) and a large cluster containing several peri-central hepatocytes (Hepatocyte_Leiden-1, Hepatocyte_Leiden-2, Hepatocyte_Leiden-25). This demonstrates that we can recapitulate known spatial enrichment of peri-portal and peri-central hepatocytes using neighborhood enrichment.

n_clusters = [4]
df_nhood_enr = pd.DataFrame(
nhood_cluster_levels = ["Level-" + str(x) for x in n_clusters]
linkage = sch.linkage(df_nhood_enr, method="average")
mat_nhood_clusters = sch.cut_tree(linkage, n_clusters=n_clusters)
df_cluster = pd.DataFrame(
    mat_nhood_clusters, columns=nhood_cluster_levels, index=meta_leiden.index.tolist()

inst_level = "Level-" + str(n_clusters[0])
all_clusters = list(df_cluster[inst_level].unique())
# sc.set_figure_params(figsize=(10,10))
for inst_cluster in all_clusters:
    inst_clusters = df_cluster[df_cluster[inst_level] == inst_cluster].index.tolist()
        figsize=(15, 15),
../../_images/3d8bae6851b25e24a23e3674906c63ce5db4ecd42fe97387adff6dd1c42a3337.png ../../_images/d6e5bf6fd1682b576bf18a43b82fd3259729d058fc5e2afed14a4d6b13ee538d.png ../../_images/0e70be895867b37eb0658b8f0a6d3836c5ad470c61a1ac3f976b43befb229943.png ../../_images/fec0e0d93d7b1538499141da3980ba19b497e9167425903d2f2e48a5e1a51535.png

Network Centrality Scores

In addition to neighborhood enrichment we can also calculate network-based centrality scores for the Leiden clusters. These include

  • closeness centrality: how close a group is to other nodes

  • degree centrality: fraction of connected non-group members

  • clustering coeffeicient: measure of the degree to which nodes cluster, "leiden")
sc.set_figure_params(figsize=(20, 8))

# copy centrality data to new DataFrame
df_central = deepcopy(adata.uns["leiden_centrality_scores"])
df_central.index = meta_leiden.index.tolist()

# sort clusters based on centrality scores
# closeness centrality - measure of how close the group is to other nodes.
ser_closeness = df_central["closeness_centrality"].sort_values(ascending=False)

# degree centrality - fraction of non-group members connected to group members.
# [Networkx](
# The degree centrality for a node v is the fraction of nodes it is connected to.
ser_degree = df_central["degree_centrality"].sort_values(ascending=False)

# clustering coefficient - measure of the degree to which nodes cluster together.
ser_cluster = df_central["average_clustering"].sort_values(ascending=False)

High Closeness Score

Groups/clusters with high closeness are close to other groups and will tend to display a dispersed distribution throughout the tissue. Three of the top five clusters based on the closeness score are epithelial cells (SEC: sinusoidal epithelial cells, AEC: arterial epithelial cells) and one cluster consists of macrophages. We see that these groups are indeed evenly distributed across the tissue which results in them being relatively close to many other groups.

inst_clusters = ser_closeness.index.tolist()[:5]
    adata, groups=inst_clusters, color="Cluster", size=15, img=False, figsize=(10, 10)
['SEC_Leiden-5', 'Hepatocyte_SEC_Leiden-3', 'Macrophage_Leiden-8', 'AEC_Potential-HSC_Leiden-11', 'Macrophage_Leiden-13']

Low Closeness Score

Groups with low closeness are not close to other groups and tend to display an uneven and isolated distribution throughout the tissue. We see that clusters with low closeness scores tend to be located near blood vessels and consist of megakaryocyte, neutrophil, and erythroid cells. Their distinct localization and proximity to blood vessel (e.g., only neighboring cells on one side while the other side faces the blood vessel) contributes to their low level of interactions with other clusters.

inst_clusters = ser_closeness.index.tolist()[-5:]
    adata, groups=inst_clusters, color="Cluster", size=15, img=False, figsize=(10, 10)
['SEC_Leiden-24', 'Hepatocyte_Leiden-25', 'Hepatocyte_MK_Leiden-22', 'Neutrophil_Leiden-20', 'Erythroid-cell_MK_Leiden-26']

High Degree Centrality

Similarly to the results from the closeness scores we observed above, we see that the SEC and Macrophage clusters (Leiden-3, Leiden-5, and Leiden-8) have high degree centrality scores indicating that they have a high fraction of non-group member connections. We also note that these clusters tend to be more evenly distributed throughout the tissue.

inst_clusters = ser_degree.index.tolist()[:5]
    adata, groups=inst_clusters, color="Cluster", size=15, img=False, figsize=(10, 10)
['SEC_Leiden-5', 'Hepatocyte_SEC_Leiden-3', 'Hepatocyte_Leiden-1', 'Hepatocyte_Leiden-2', 'Macrophage_Leiden-8']

Low Degree Centrality

These cluster have particularly low non-group member connections and similarly to the results from the closeness score we see that these clusters tend to be near blood vessels. We also note that more of the clusters tend to be lower abundance clusters (Leiden cluster numbers are ranked by abundance).

inst_clusters = ser_degree.index.tolist()[-5:]
    adata, groups=inst_clusters, color="Cluster", size=15, img=False, figsize=(10, 10)
['Hepatocyte_SEC_Leiden-23', 'Neutrophil_Leiden-20', 'SEC_Leiden-24', 'Hepatocyte_Leiden-25', 'Erythroid-cell_MK_Leiden-26']

High Clustering Coefficient

The clustering coefficient indicates the degree to which nodes cluster together. We see that the top scoring clusters tend to be located near blood vessels. This distribution is somewhat counter-intuitive since one might expect that well segregated cells would form dense blobs rather than thin lines. However, their position along the edge of of blood vessels likely reduces the total number of neighbors each cell has since they are positioned at the border of a hole in the tissue.

inst_clusters = ser_cluster.index.tolist()[:5]
    adata, groups=inst_clusters, color="Cluster", size=15, img=False, figsize=(10, 10)
['MK_Leiden-16', 'HSC_Myeloid_Leiden-21', 'Neutrophil_Leiden-20', 'Erythroid-cell_MK_Leiden-26', 'Hepatocyte_SEC_Leiden-3']

Low Clustering Coefficient

Again, we see that non-isolated groups tend to be more evenly distributed throughout the tissue. Interestingly, these clusters mostly consist of hepatocytes.

inst_clusters = ser_cluster.index.tolist()[-5:]
    adata, groups=inst_clusters, color="Cluster", size=15, img=False, figsize=(15, 15)
['Erythroid-cell_Erythroid-progenitor_Pre-B-cell_Leiden-10', 'Hepatocyte_Leiden-2', 'Hepatocyte_SEC_Leiden-23', 'Hepatocyte_Leiden-0', 'Hepatocyte_Leiden-7']

Autocorrelation: Moran’s I Score

Our previous focus has been mainly on the distribution of cell clusters throughout the tissue. However we can also use Squidpy to investigate the spatial distributions of genes expressed in the tissue.

Here we use Squidpy to calculate the Moran’s I global spatial auto-correlation statistic, which can be used to identify genes that are non-randomly distributed in the tissue. We will visualize the top and bottom 20 scoring genes to highlight specific examples of genes with high and low auto-correlation., mode="moran")
num_view = 12
top_autocorr = (
bot_autocorr = (

Genes with high spatial autocorrelation

We note that many of the top scoring genes show expression patterns following the spatial pattern of hepatocyte Leiden clusters (e.g. Aldh1b1, Aldh3a20) and blood vessels (e.g. Dpt, Sfrp1). We can also check which cell types are most associated with these highly auto-correlated genes by ranking cell clusters based on their expression of these genes (e.g. mean expression). Below we see that three of the top five cell clusters are hepatocytes, which agrees with the gene’s localization patterns. These results indicate the primary spatial patterns in gene expression auto-correlation are being driven by their expression in hepatocytes and blood vessel associated cell types.
    adata, color=top_autocorr, size=20, cmap="Reds", img=False, figsize=(5, 5)
# top cell types based on average expression of top_autocorr genes

Genes with low autocorrelation

Genes with low auto-correlation show more evenly distributed expression patterns that do not follow hepatic zonation.
    adata, color=bot_autocorr, size=20, cmap="Reds", img=False, figsize=(5, 5)
# top cell types based on average expression of bot_autocorr genes


This tutorial shows how we can use Squidpy and Scanpy to explore the spatial distribution of single-cells and gene expression in mouse liver tissue from Vizgen’s MERFISH Mouse Liver Map. One of the main highlights is the intricate hepatic zonation patterns and their relationship to portal/central veins in the mouse liver.