Introduction to `tpSVG`
Boyi Guo
Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USASource:
vignettes/intro_to_tpSVG.Rmd
intro_to_tpSVG.Rmd
The objective of tpSVG
is to detect spatially variable
genes (SVG) when analyzing spatially-resolved transcriptomics data. This
includes both unsupervised features where there’s not additional
information is supplied besides the (normalized) gene counts and spatial
coordinates, but also the spatial variation explained besides some
covariates, such as tissue anatomy or possibly cell type composition.
Compared to previous SVG detection tools, tpSVG
provides
a scalable solution to model the gene expression as counts instead of
logarithm-transformed counts. While log transformation provides
convenience to model the spatial gene expression by mapping count data
to the continuous domain, hence enabling well-understood Gaussian
models, log transformation distorts low expressed genes counts and
create bias populating high-expressed genes. For example, the rank of
genes based on their effect size are commonly used for dimensional
reduction, or its input. Hence, estimating gene ranking correctly is
very important. Gaussian models, exemplified with nnSVG
,
often dissociates the mean-variance relationship which is commonly
assumed for counts data, and hence often prioritizes the highly
expressed genes over the lowly expressed genes. In the figure below, we
saw that nnSVG
is susceptible to such mean-rank
relationship, meaning highly expressed genes are often ranked highly. In
contrast, the proposed tpSVG
with Poisson distribution is
not susceptible to this mean-rank relationship when examining the DLPFC dataset.
Installation
GitHub
You can install the development version of tpSVG from GitHub with:
# install.packages("devtools")
devtools::install_github("boyiguo1/tpSVG")
Bioconductor (pending)
The package is currently submitted to Bioconductor for review.
Once the package is accepted by Bioconductor, you can install the latest
release version of tpSVG
from Bioconductor via the
following code. Additional details are shown on the Bioconductor
page.
if (!require("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("tpSVG")
The latest development version can also be installed from the
devel
version of Bioconductor or from GitHub following
BiocManager::install(version='devel')
Modeling spatially resolved gene expression using
tpSVG
In the following section, we demonstrate the how to use
tpSVG
with a SpatialExperiment
object.
Load Packages
To run the demonstration, there a couple of necessary packages to
load. We will use the data set in STexampleData
,
which contains a 10x Visium dataset. Specifically, we will find one
10xVisium sample collected from the dorso lateral prefrontal region of a
postmortem human brain from STexampleData
package, indexed by brain number of “151673” from Maynard,
Collado-Torres et al. (2021). For more information, please
see the vignettes of STexampleData
library(tpSVG)
library(SpatialExperiment)
library(STexampleData) # Example data
library(scuttle) # Data preprocess
Data processing
Before running the analysis using tpSVG
, we will
preprocess the data, which includes 1) calculating normalization factor,
or equivalently library size; 2) down-sizing the number of genes to 6 to
reduce running time. These preprocessing step may not be necessary if
real world data analysis.
spe <- Visium_humanDLPFC()
spe <- spe[, colData(spe)$in_tissue == 1]
spe <- logNormCounts(spe)
# Normalization factor
head(spe$sizeFactor)
#> AAACAAGTATCTCCCA-1 AAACAATCTACTAGCA-1 AAACACCAATAACTGC-1 AAACAGAGCGACTCCT-1
#> 1.8428941 0.3632188 0.8212187 1.1837838
#> AAACAGCTTTCAGAAG-1 AAACAGGGTCTATATT-1
#> 0.9321235 0.8724223
# Equivalently, library size
spe$total <- counts(spe) |> colSums()
# Down-sizing genes for faster computation
idx <- which(
rowData(spe)$gene_name %in% c("MOBP", "PCP4", "SNAP25",
"HBB", "IGKC", "NPY")
)
spe <- spe[idx, ]
Modeling raw counts with Poisson model
The following example demonstrates how to model raw gene counts data
with tpSVG
. The model fitting is simple, following
stats::glm
syntax to follow any distributional assumption
via the argument family
. The only key point to mention here
is the model needs to account for any technical variation due to the
gene profiling procedure. To account for such techincal variation, we
use offset
term in the model. In the following example, we
use the commonly used library size as the normalizing factor, and hence
set offset = log(spe$total)
to account for the techinical
variation in the data. Equivalently, it is also possible/encouraged to
use offset = log(spe$sizeFactor)
, where
spe$sizeFactor
is calculated during
logNormCounts
and is a linear function of the library size,
i.e. spe$total
. Note: it is very important to use the log
function in the offset
making sure the data scale is
conformable.
set.seed(1)
spe_poisson <- tpSVG(
spe,
family = poisson,
assay_name = "counts",
offset = log(spe$total) # Natural log library size
)
rowData(spe_poisson)
#> DataFrame with 6 rows and 6 columns
#> gene_id gene_name feature_type test_stat raw_p
#> <character> <character> <character> <numeric> <numeric>
#> ENSG00000211592 ENSG00000211592 IGKC Gene Expression 3447.12 0
#> ENSG00000168314 ENSG00000168314 MOBP Gene Expression 7143.34 0
#> ENSG00000122585 ENSG00000122585 NPY Gene Expression 3596.18 0
#> ENSG00000244734 ENSG00000244734 HBB Gene Expression 1536.56 0
#> ENSG00000132639 ENSG00000132639 SNAP25 Gene Expression 3529.46 0
#> ENSG00000183036 ENSG00000183036 PCP4 Gene Expression 1381.63 0
#> runtime
#> <numeric>
#> ENSG00000211592 0.408
#> ENSG00000168314 0.391
#> ENSG00000122585 0.388
#> ENSG00000244734 0.385
#> ENSG00000132639 0.405
#> ENSG00000183036 0.365
Gaussian model for log-transformed normalized data
tpSVG
provides flexibility regarding the distributional
assumption. If interested, it is possible to model log-transformed count
data using Gaussian distribution.
Covariate-adjusted Model
It is scientifically interesting to understand if and how much additional spatial variation in gene expression when accounting some known biology. For example, if known anatomy is accounted for in the model, is there any additional spatial variation in the gene expression which can be informative to any unknown biology. Statistically, this question is known as covariate adjustment, where the known biology is quantified and accounted for in a model.
To address this question, tpSVG
allows introducing
covariates in the model via the argument X
, where
X
takes a vector of any kind, including categorical
variables.
The frist step is to remove any missing data in the dataset,
specifically as the covariate. This can be done via
complete.cases
.
# Check missing data
idx_complete_case <- complete.cases(spe$ground_truth)
# If multiple covariates
# idx_complete_case <- complete.cases(spe$ground_truth, spe$cell_count)
# Remove missing data
spe <- spe[, idx_complete_case]
# Create a design matrix
x <- spe$ground_truth
spe_poisson_cov <- tpSVG(
spe,
X = x,
family = poisson,
assay_name = "counts",
offset = log(spe$total) # Natural log library size
)
image-based SRT in SpatialExperiment
(e.g. SpatialFeatureExperiment
)
tpSVG
can be also used to model image-based SRT data. We
use the seqFISH data from Lohoff and
Ghazanfar et al. (2020) to demonstrate tpSVG
.
Specifically, we use the curated example data in STexampleData
package. For more information, please see the vignettes of STexampleData
library(STexampleData)
spe <- seqFISH_mouseEmbryo()
#> see ?STexampleData and browseVignettes('STexampleData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> Loading required package: BumpyMatrix
spe
#> class: SpatialExperiment
#> dim: 351 11026
#> metadata(0):
#> assays(2): counts molecules
#> rownames(351): Abcc4 Acp5 ... Zfp57 Zic3
#> rowData names(1): gene_name
#> colnames(11026): embryo1_Pos0_cell10_z2 embryo1_Pos0_cell100_z2 ...
#> embryo1_Pos28_cell97_z2 embryo1_Pos28_cell98_z2
#> colData names(14): cell_id embryo ... segmentation_vertices sample_id
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> spatialCoords names(2) : x y
#> imgData names(0):
The example data set contains 351
genes for
11026
genes. To make the demonstration computationally
feasible, we down-size the number of genes to 1. The average computation
times for 11026 cells is roughly 2 minutes.
# Calculate "library size"
spe$total <- counts(spe) |> colSums()
# Down-size genes
idx_gene <- which(
rowData(spe)$gene_name %in%
c("Sox2")
)
library(tpSVG)
# Poisson model
tp_spe <- tpSVG(
input = spe[idx_gene,],
family = poisson(),
offset = log(spe$total),
assay_name = "counts")
rowData(tp_spe)
#> DataFrame with 1 row and 4 columns
#> gene_name test_stat raw_p runtime
#> <character> <numeric> <numeric> <numeric>
#> Sox2 Sox2 9588.51 0 1.326
Session Info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R Under development (unstable) (2024-02-26 r85990)
#> os Ubuntu 22.04.4 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language en
#> collate C.UTF-8
#> ctype C.UTF-8
#> tz UTC
#> date 2024-02-28
#> pandoc 3.1.11 @ /opt/hostedtoolcache/pandoc/3.1.11/x64/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> abind 1.4-5 2016-07-21 [1] RSPM
#> AnnotationDbi 1.65.2 2023-11-03 [1] Bioconductor
#> AnnotationHub * 3.11.1 2023-12-11 [1] Bioconduc~
#> beachmat 2.19.1 2024-01-22 [1] Bioconduc~
#> Biobase * 2.63.0 2023-10-24 [1] Bioconductor
#> BiocFileCache * 2.11.1 2023-10-26 [1] Bioconductor
#> BiocGenerics * 0.49.1 2023-11-01 [1] Bioconductor
#> BiocManager 1.30.22 2023-08-08 [1] RSPM
#> BiocParallel 1.37.0 2023-10-24 [1] Bioconductor
#> BiocStyle * 2.31.0 2023-10-24 [1] Bioconductor
#> BiocVersion 3.19.1 2023-10-26 [1] Bioconductor
#> Biostrings 2.71.2 2024-01-28 [1] Bioconduc~
#> bit 4.0.5 2022-11-15 [1] RSPM
#> bit64 4.0.5 2020-08-30 [1] RSPM
#> bitops 1.0-7 2021-04-24 [1] RSPM
#> blob 1.2.4 2023-03-17 [1] RSPM
#> bookdown 0.37 2023-12-01 [1] RSPM
#> bslib 0.6.1 2023-11-28 [1] RSPM
#> BumpyMatrix * 1.11.0 2023-10-24 [1] Bioconductor
#> cachem 1.0.8 2023-05-01 [1] RSPM
#> cli 3.6.2 2023-12-11 [1] RSPM
#> codetools 0.2-19 2023-02-01 [3] CRAN (R 4.4.0)
#> crayon 1.5.2 2022-09-29 [1] RSPM
#> curl 5.2.0 2023-12-08 [1] RSPM
#> DBI 1.2.2 2024-02-16 [1] RSPM
#> dbplyr * 2.4.0 2023-10-26 [1] RSPM
#> DelayedArray 0.29.7 2024-02-26 [1] Bioconduc~
#> DelayedMatrixStats 1.25.1 2023-11-09 [1] Bioconductor
#> desc 1.4.3 2023-12-10 [1] RSPM
#> digest 0.6.34 2024-01-11 [1] RSPM
#> dplyr 1.1.4 2023-11-17 [1] RSPM
#> evaluate 0.23 2023-11-01 [1] RSPM
#> ExperimentHub * 2.11.1 2023-12-11 [1] Bioconduc~
#> fansi 1.0.6 2023-12-08 [1] RSPM
#> fastmap 1.1.1 2023-02-24 [1] RSPM
#> filelock 1.0.3 2023-12-11 [1] RSPM
#> fs 1.6.3 2023-07-20 [1] RSPM
#> generics 0.1.3 2022-07-05 [1] RSPM
#> GenomeInfoDb * 1.39.6 2024-02-08 [1] Bioconduc~
#> GenomeInfoDbData 1.2.11 2024-02-28 [1] Bioconductor
#> GenomicRanges * 1.55.3 2024-02-13 [1] Bioconduc~
#> glue 1.7.0 2024-01-09 [1] RSPM
#> htmltools 0.5.7 2023-11-03 [1] RSPM
#> httr 1.4.7 2023-08-15 [1] RSPM
#> IRanges * 2.37.1 2024-01-19 [1] Bioconduc~
#> jquerylib 0.1.4 2021-04-26 [1] RSPM
#> jsonlite 1.8.8 2023-12-04 [1] RSPM
#> KEGGREST 1.43.0 2023-10-24 [1] Bioconductor
#> knitr 1.45 2023-10-30 [1] RSPM
#> lattice 0.22-5 2023-10-24 [3] CRAN (R 4.4.0)
#> lifecycle 1.0.4 2023-11-07 [1] RSPM
#> magick 2.8.3 2024-02-18 [1] RSPM
#> magrittr 2.0.3 2022-03-30 [1] RSPM
#> Matrix 1.6-5 2024-01-11 [3] CRAN (R 4.4.0)
#> MatrixGenerics * 1.15.0 2023-10-24 [1] Bioconductor
#> matrixStats * 1.2.0 2023-12-11 [1] RSPM
#> memoise 2.0.1 2021-11-26 [1] RSPM
#> mgcv * 1.9-1 2023-12-21 [3] CRAN (R 4.4.0)
#> mime 0.12 2021-09-28 [1] RSPM
#> nlme * 3.1-164 2023-11-27 [3] CRAN (R 4.4.0)
#> pillar 1.9.0 2023-03-22 [1] RSPM
#> pkgconfig 2.0.3 2019-09-22 [1] RSPM
#> pkgdown 2.0.7 2022-12-14 [1] any (@2.0.7)
#> png 0.1-8 2022-11-29 [1] RSPM
#> purrr 1.0.2 2023-08-10 [1] RSPM
#> R6 2.5.1 2021-08-19 [1] RSPM
#> ragg 1.2.7 2023-12-11 [1] RSPM
#> rappdirs 0.3.3 2021-01-31 [1] RSPM
#> Rcpp 1.0.12 2024-01-09 [1] RSPM
#> RCurl 1.98-1.14 2024-01-09 [1] RSPM
#> rjson 0.2.21 2022-01-09 [1] RSPM
#> rlang 1.1.3 2024-01-10 [1] RSPM
#> rmarkdown 2.25 2023-09-18 [1] RSPM
#> RSQLite 2.3.5 2024-01-21 [1] RSPM
#> S4Arrays 1.3.4 2024-02-26 [1] Bioconduc~
#> S4Vectors * 0.41.3 2024-01-01 [1] Bioconduc~
#> sass 0.4.8 2023-12-06 [1] RSPM
#> scuttle * 1.13.0 2023-10-24 [1] Bioconductor
#> sessioninfo 1.2.2 2021-12-06 [1] any (@1.2.2)
#> SingleCellExperiment * 1.25.0 2023-10-24 [1] Bioconductor
#> SparseArray 1.3.4 2024-02-11 [1] Bioconduc~
#> sparseMatrixStats 1.15.0 2023-10-24 [1] Bioconductor
#> SpatialExperiment * 1.13.0 2023-10-24 [1] Bioconductor
#> STexampleData * 1.11.0 2023-10-31 [1] Bioconductor
#> stringi 1.8.3 2023-12-11 [1] RSPM
#> stringr 1.5.1 2023-11-14 [1] RSPM
#> SummarizedExperiment * 1.33.3 2024-01-23 [1] Bioconduc~
#> systemfonts 1.0.5 2023-10-09 [1] RSPM
#> textshaping 0.3.7 2023-10-09 [1] RSPM
#> tibble 3.2.1 2023-03-20 [1] RSPM
#> tidyselect 1.2.0 2022-10-10 [1] RSPM
#> tpSVG * 0.99.7 2024-02-28 [1] local
#> utf8 1.2.4 2023-10-22 [1] RSPM
#> vctrs 0.6.5 2023-12-01 [1] RSPM
#> withr 3.0.0 2024-01-16 [1] RSPM
#> xfun 0.42 2024-02-08 [1] RSPM
#> XVector 0.43.1 2024-01-10 [1] Bioconduc~
#> yaml 2.3.8 2023-12-11 [1] RSPM
#> zlibbioc 1.49.0 2023-10-24 [1] Bioconductor
#>
#> [1] /home/runner/work/_temp/Library
#> [2] /opt/R/devel/lib/R/site-library
#> [3] /opt/R/devel/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────