Joint definition of cell types from multiple scRNA-seq datasets
Joshua Sodicoff and Joshua Welch
2023-02-27
Source:vignettes/articles/Integrating_multi_scRNA_data.rmd
Integrating_multi_scRNA_data.rmd
This guide will demonstrate the usage of the Liger package in the style of the R Console, which can be accessed through an R development environment (e.g., RStudio) or directly from the R command line.
Stage I: Preprocessing and Normalization
1. Loading data
For the first portion of this protocol, we will be integrating data from control and interferon-stimulated PBMCs from Kang et al, 2017. The data can be found in the Gene Expression Omnibus, Series GSE96583. This dataset was originally in the form of output from the 10X Cellranger pipeline, though we will directly load downsampled versions of the control and stimulated DGEs here.
For convenience, we have prepared the pre-processed data
which are ready to use. There are two datasets: PBMC control
and PBMC interferon-stimulated. We provided ready-to-use liger object,
which can be easily loaded with importPBMC()
.
library(rliger)
pbmcLiger <- importPBMC()
For creating a liger object from raw counts data or any other types of source, please refer to the detailed tutorial for importing data.
2. Preprocess
Before we can run iNMF on our datasets, we must run several preprocessing steps to normalize expression data to account for differences in sequencing depth and efficiency between cells, identify variably expressed genes, and scale the data so that each gene has the same variance. Note that because nonnegative matrix factorization requires positive values, we do not center the data by subtracting the mean. We also do not log transform the data.
pbmcLiger <- pbmcLiger %>%
normalize() %>%
selectGenes() %>%
scaleNotCenter()
Stage II: Integration with Joint Matrix Factorization
3. Determine parameters and perform iNMF integration
We are now able to run integrative non-negative matrix factorization
(iNMF) on the normalized and scaled datasets. The key parameter for this
analysis is k
, the number of matrix factors (analogous to
the number of principal components in PCA). In general, we find that a
value of k
between 20 and 40 is suitable for most analyses
and that results are robust for choice of k
. Because LIGER
is an unsupervised, exploratory approach, there is no single “right”
value for k
. In practice, users choose k
from
a combination of biological prior knowledge and other information. For
this tutorial, we set k = 20
.
pbmcLiger <- runIntegration(pbmcLiger, k = 20)
Starting from rliger 2.0.0, we use an optimized implementation of iNMF. Here we deprecated the parameter
thresh
which stands for a convergence detecter in order to speed up each algorithm iteration by omitting the calculation of objective error.
The factorization yields several lower dimension matrices, including the \(H\) matrices of metagene loadings for each cell, the \(W\) matrix of shared factor loadings and the \(V\) matrices of dataset-specific factor loadings. Please refer to liger object documentation for how to access them.
The time consumption of this step is dependent of the size of the
datasets, in terms of number of cells, number of variable genes
selected, and the value of k
. The implementation supports
OpenMP multi-threading, and there for using a machine with a number of
cores allocated helps speeding it up.
Stage III: Quantile Normalization and Joint Clustering
4. Align the factors
We can now use the resulting factors to jointly cluster cells and
perform quantile normalization by dataset, factor, and cluster to fully
integrate the datasets. All of this functionality is encapsulated within
the quantileNorm()
function, which uses max factor
assignment followed by refinement using a k-nearest neighbors graph.
pbmcLiger <- quantileNorm(pbmcLiger)
5. Clustering
The quantileNorm()
procedure produces joint clustering
assignments and a low-dimensional representation that integrates the
datasets together. These joint clusters directly from iNMF can be used
for downstream analyses (see below). Alternatively, you can also run
Louvain community detection, an algorithm commonly used for single-cell
data, on the normalized cell factors. The Louvain algorithm excels at
merging small clusters into broad cell classes and thus may be more
desirable in some cases than the maximum factor assignments produced
directly by iNMF.
pbmcLiger <- runCluster(pbmcLiger, resolution = 0.25, nNeighbors = 30)
Starting from rliger 2.0.0, cluster labeling will be stored in cell metadata, which can be accessed with
cellMeta(pbmcLiger)
. Use argumentclusterName
to specify unique variable names for the result can enable storing multiple cluster labeling variables at the same time.
Stage IV: Visualization and Downstream Analysis
6. Generate dimensionality reduced embedding
To visualize the clustering of cells graphically, we can project the normalized cell factors to two or three dimensions. LIGER supports both UMAP and t-SNE for this purpose.
pbmcLiger <- runUMAP(pbmcLiger, n_neighbors = 30, min_dist = 0.3)
Starting from rliger 2.0.0, the slot for storing dimensionality reduction matrices will be renamed to “dimReds”. It will be a list that can hold multiple low dimensional matrices that match to the dataset by cell identifiers. Users can access individual matrix with
dimRed(object, "name")
. Use argumentdimredName
to specify unique names for the UMAP result so that it allows storing multiple low-dimensional representation matrices at the same time.
7. Create plots
We provide a variety of utilities for visualization and analysis of clustering, gene expression across datasets, and comparisons of cluster assignments. Here we demonstrate several commonly used examples.
plotByDatasetAndCluster()
returns two graphs, generated
by t-SNE or UMAP in the previous step. The first colors cells by dataset
of origin, and the second by cluster as determined by previous
clustering step. The plots provide visual confirmation that the datasets
are well aligned and the clusters are consistent with the shape of the
data as revealed by UMAP.
The two subplots can individually be generated with
plotDatasetDimRed()
and plotClusterDimRed()
,
respectively.
plotByDatasetAndCluster(pbmcLiger)
To directly study the impact of factors on the clustering and
determine what genes load most highly on each factor, we use the
plotGeneLoadings()
function, which returns plots of factor
loading on the dimensionality reduction and highly loaded genes by
dataset for each factor.
factorMarkers <- getFactorMarkers(pbmcLiger, dataset1 = "ctrl", dataset2 = "stim")
plotGeneLoadings(pbmcLiger, markerTable = factorMarkers, useFactor = 11)
8. Differential expression
Using the runMarkerDEG()
function, we can next identify
gene markers for all clusters. We can also compare expression within
each cluster across datasets, which in this case reveals markers of
interferon-beta stimulation. The function returns a table of data that
allows us to determine the significance of each gene’s differential
expression, including log fold change, area under the curve (auc) and
p-value.
The default parameters performs Wilcoxon rank-sum test at a cluster level:
cluster.results <- runMarkerDEG(pbmcLiger)
head(cluster.results)
## feature group avgExpr logFC statistic auc pval
## 1 LINC00115 0 0.06557615 -0.006707732 19046596 0.4998280 7.902339e-01
## 2 NOC2L 0 0.42228382 -0.916821744 17926394 0.4704313 1.876299e-33
## 3 KLHL17 0 0.01232972 -0.009023670 19042684 0.4997253 4.105767e-01
## 4 PLEKHN1 0 0.18811948 0.141802251 19232439 0.5047050 3.185813e-11
## 5 HES4 0 4.78053287 3.500041958 23275316 0.6107997 4.655312e-241
## 6 ISG15 0 13.53789841 2.106235323 25048607 0.6573351 7.640631e-185
## padj pct_in pct_out
## 1 8.228727e-01 0.43243243 0.4660647
## 2 1.603211e-32 2.81081081 8.5542286
## 3 4.708915e-01 0.08108108 0.1359355
## 4 1.232679e-10 1.24324324 0.3010001
## 5 1.816141e-239 30.48648649 8.1270026
## 6 2.512153e-183 71.37837838 64.4819885
Alternatively, it is also helpful to identify dataset specific markers within each cluster. For example in this tutorial, we can split data by cluster labeling, and within a cluster, find the markers from interferon-stimulated cells.
datasets.results <- runMarkerDEG(pbmcLiger, conditionBy = "dataset", splitBy = "leiden_cluster")
head(datasets.results$`0`) # Note that the first cluster is "0"
## feature group avgExpr logFC statistic auc pval
## 1 LINC00115 ctrl 0.07183401 0.01158862 1701494.0 0.5003517031 7.453285e-01
## 2 NOC2L ctrl 0.66393901 0.44750960 1750817.0 0.5148559253 5.085806e-08
## 3 KLHL17 ctrl 0.02680373 0.02680373 1703295.0 0.5008813161 6.054223e-02
## 4 PLEKHN1 ctrl 0.09765942 -0.16751863 1681486.5 0.4944681756 2.474522e-03
## 5 HES4 ctrl 0.18904920 -8.50274754 775284.5 0.2279848885 3.942972e-269
## 6 ISG15 ctrl 6.05199165 -13.86279031 2251.0 0.0006619428 0.000000e+00
## padj pct_in pct_out
## 1 8.278741e-01 0.4700353 0.4004004
## 2 3.231322e-07 4.4065805 1.4514515
## 3 1.220948e-01 0.1762632 0.0000000
## 4 7.671268e-03 0.6462985 1.7517518
## 5 7.147217e-267 1.2338425 55.4054054
## 6 0.000000e+00 37.7790834 100.0000000
The number of significant genes identified by
runMarkerDEG()
varies and depends on the datasets used. And
the raw output of the function contains the statistics of all tested
genes in all groups (clusters). In order to pick out the top markers for
each cluster, we strongly suggest using package “dplyr”, which provides
a user-friendly interface for data table manipulation. The following
code chunk first filters the markers which are statistically and
biologically significant. For example, we filter the output by taking
markers which have padj (Benjamini-Hochberg adjusted p-value) less than
0.05 and logFC (log fold change between observations in group versus
out) larger than 3. Then for each cluster, we sort the markers primarily
by its padj value in ascending order. Given that mathematically, the
lowest padj values are rounded to 0 as they are too small, for genes
tying on this metric, we then sort the markers by logFC in descending
order. Finally, we select the top 20 markers for each cluster.
library(dplyr)
cluster.results.sort <- cluster.results %>%
filter(padj < 0.05, logFC > 3) %>%
group_by(group) %>%
arrange(padj, -logFC, .by_group = TRUE) %>%
top_n(20)# rank by logFC from high to low
# Show the markers for cluster 3
cluster.results.sort %>% filter(group == 3)
## # A tibble: 20 × 10
## # Groups: group [1]
## feature group avgExpr logFC statistic auc pval padj pct_in
## <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 HLA-DRA 3 17.3 8.42 14081152 0.800 0 0 97.9
## 2 CD74 3 18.0 7.01 16134037 0.917 0 0 99.1
## 3 HLA-DPA1 3 14.5 7.76 13686101 0.778 3.53e-302 2.50e-299 87.2
## 4 RPS5 3 17.0 3.20 13721636 0.780 7.87e-260 4.19e-257 98.2
## 5 HLA-DPB1 3 13.6 6.73 13249646. 0.753 2.68e-251 1.27e-248 81.3
## 6 HLA-DRB1 3 15.3 7.22 13192870. 0.750 3.92e-231 1.47e-228 90.1
## 7 CCR7 3 14.0 7.14 12975000. 0.737 5.12e-223 1.72e-220 83.0
## 8 RPL10A 3 16.2 3.85 12668400 0.720 6.70e-163 1.40e-160 95.0
## 9 EEF1B2 3 13.1 4.34 12392988. 0.704 1.99e-150 3.57e-148 80.0
## 10 RPLP0 3 15.5 3.55 12437076. 0.707 1.40e-144 2.23e-142 92.4
## 11 HSP90AB1 3 12.4 3.63 12280914. 0.698 5.75e-142 8.63e-140 74.6
## 12 RPL5 3 14.5 4.01 12272426. 0.698 8.83e-135 1.24e-132 87.2
## 13 RPSA 3 14.8 3.96 12242276. 0.696 1.30e-131 1.71e-129 89.0
## 14 CXCR4 3 12.9 3.83 12106296. 0.688 1.21e-127 1.47e-125 76.2
## 15 RPL7A 3 14.9 3.03 12167762. 0.692 2.55e-124 2.93e-122 89.7
## 16 RAN 3 10.9 3.41 11805844. 0.671 1.96e-112 1.94e-110 66.5
## 17 RPL4 3 13.8 3.24 11883184. 0.675 1.05e-106 8.96e-105 83.2
## 18 NPM1 3 12.2 3.74 11778232. 0.669 7.79e-106 6.58e-104 73.6
## 19 CREM 3 10.1 3.38 11163252. 0.634 5.25e- 74 3.00e- 72 61.7
## 20 GLTSCR2 3 10.2 3.26 11106918. 0.631 3.24e- 69 1.71e- 67 63.4
## # ℹ 1 more variable: pct_out <dbl>
We can then visualize the expression profiles of individual genes,
such as the differentially expressed genes that we just identified. This
allows us to visually confirm the cluster- or dataset-specific
expression patterns of marker genes. plotGeneDimRed()
returns graphs of gene loading on the dimensionality reduced graph for
each dataset.
plotGeneDimRed(pbmcLiger, "PRF1")
We can also plot the gene expression by dataset.
prf1List <- plotGeneDimRed(pbmcLiger, "PRF1", splitBy = "dataset")
cowplot::plot_grid(plotlist = prf1List, labels = names(prf1List))
We can also use plotGeneDimRed()
to compare the loading
of cluster markers within and between datasets.
IFIT3 <- plotGeneDimRed(pbmcLiger, "IFIT3", splitBy = "dataset")
IFITM3 <- plotGeneDimRed(pbmcLiger, "IFITM3", splitBy = "dataset")
cowplot::plot_grid(IFIT3[[1]], IFIT3[[2]], IFITM3[[1]], IFITM3[[2]], ncol = 2, labels = c("ctrl", "stim", "ctrl", "stim"))
R Session Info
## R version 4.3.2 RC (2023-10-30 r85440)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.4.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Detroit
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.1.4 rliger_2.0.1
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3
## [4] lattice_0.22-6 digest_0.6.35 magrittr_2.0.3
## [7] evaluate_0.23 grid_4.3.2 fastmap_1.1.1
## [10] jsonlite_1.8.8 Matrix_1.6-5 ggrepel_0.9.5
## [13] scattermore_1.2 purrr_1.0.2 fansi_1.0.6
## [16] viridisLite_0.4.2 scales_1.3.0 codetools_0.2-20
## [19] textshaping_0.3.7 jquerylib_0.1.4 cli_3.6.2
## [22] rlang_1.1.3 RcppAnnoy_0.0.22 uwot_0.1.16
## [25] cowplot_1.1.3 munsell_0.5.1 withr_3.0.0
## [28] RANN_2.6.1 cachem_1.0.8 yaml_2.3.8
## [31] tools_4.3.2 parallel_4.3.2 memoise_2.0.1
## [34] colorspace_2.1-0 ggplot2_3.5.0 sccore_1.0.5
## [37] BiocGenerics_0.48.1 vctrs_0.6.5 R6_2.5.1
## [40] stats4_4.3.2 lifecycle_1.0.4 S4Vectors_0.40.2
## [43] fs_1.6.3 irlba_2.3.5.1 ragg_1.3.0
## [46] pkgconfig_2.0.3 leidenAlg_1.1.3 desc_1.4.3
## [49] pkgdown_2.0.7 pillar_1.9.0 bslib_0.7.0
## [52] gtable_0.3.4 glue_1.7.0 Rcpp_1.0.12
## [55] systemfonts_1.0.6 highr_0.10 xfun_0.43
## [58] tibble_3.2.1 tidyselect_1.2.1 rstudioapi_0.16.0
## [61] knitr_1.45 farver_2.1.1 htmltools_0.5.8
## [64] igraph_2.0.3 labeling_0.4.3 rmarkdown_2.26
## [67] RcppPlanc_1.0.0 compiler_4.3.2