Comparing samples defined over a single dimension is a straightforward task that relies on standard, well established methods. Meanwhile distance between samples in high dimensional space remains a largely unexplored field. Available solutions rely on multivariate normal distributions, a condition that is both difficult to check and overlooking key behaviors in biological samples, where populations of interest often correspond to a small proportion (<1%) of all the points that have been measured.
We have developed hilbertSimilarity
to address the
problem of sample similarity in mass cytometry where samples are
measured over up to 100 dimensions, and where each sample contains a
slightly different number of points (or cells). Our method first
transforms each sample into a probability vector, by dividing each
dimension into a fixed number of bins and by associating each cell to a
specific multidimensional cube. The proportion of cells in each
hypercube is characteristic of a given sample. To characterize an
compare samples we use measures from Information Theory, since their
interpretation in terms of similarity and complexity is
straightforward.
After determining sample similarity, our method can also be used to determine which cells are different between samples, by comparing the number of cells in each hypercube to a reference treatment. Since every sample contains a different number of cells, we use a bootstrap approach where the same number of cells is sampled from each treatment to allow for a direct comparison.
To demonstrate the power of hilbertSimilarity
we applied
the method to a subset of the bodenmiller et al. dataset,
comparing the effect of different stimulations and identifying groups of
cells that are significantly affected by different treatments.
Compared to other methods, hilbertSimilarity
does not
rely on expert-driven gating, or require any hypothesis about the number
of populations that are present in the data. This makes it a powerful
tool to quickly assess a dataset before using traditional methods, or
when populations a not known a priori.
hilbertSimilarity
can be installed using the following
command
devtools::install_github(yannabraham/hilbertSimilarity)
Once installed the package can be loaded using the standard
library
command.
We first need data from the bodenmiller
package:
library(bodenmiller)
data(refPhenoMat)
data(untreatedPhenoMat)
data(refFuncMat)
data(untreatedFuncMat)
data(refAnnots)
refAnnots$Treatment <- 'reference'
data(untreatedAnnots)
fullAnnots <- rbind(refAnnots[,names(untreatedAnnots)],
untreatedAnnots)
fullAnnots$Treatment <- factor(fullAnnots$Treatment)
fullAnnots$Treatment <- relevel(fullAnnots$Treatment,'reference')
refMat <- cbind(refPhenoMat,refFuncMat)
untreatedMat <- cbind(untreatedPhenoMat,
untreatedFuncMat)
fullMat <- rbind(refMat,untreatedMat)
fullMat <- apply(fullMat,2,function(x) {
x[x<0] <- 0
return(x)
}
)
In this dataset 51936 cells corresponding to 4 have been measured over 23 channels. Cells have been gated into 14 populations. The following treatments are compared to one another:
reference | BCR/FcR-XL | PMA/Ionomycin | vanadate | |
---|---|---|---|---|
cd14-hladr- | 22 | 18 | 25 | 27 |
cd14-hladrhigh | 60 | 19 | 4 | 1 |
cd14-hladrmid | 244 | 149 | 54 | 18 |
cd14-surf- | 1807 | 1845 | 1520 | 1824 |
cd14+hladr- | 13 | 17 | 15 | 91 |
cd14+hladrhigh | 21 | 3 | 5 | 1 |
cd14+hladrmid | 1373 | 839 | 298 | 155 |
cd14+surf- | 81 | 77 | 214 | 166 |
cd4+ | 3989 | 4076 | 4546 | 503 |
cd8+ | 5068 | 5098 | 5122 | 1200 |
dendritic | 58 | 30 | 28 | 23 |
igm- | 232 | 330 | 181 | 90 |
igm+ | 1069 | 863 | 992 | 395 |
nk | 1755 | 1888 | 1572 | 1822 |
The smallest cell population, cd14+hladrhigh, corresponds to 0.058% of the total cells.
We will first compute the similarity between treatments, by generating a 3rd-order Hilbert curve using combined cuts:
## CD20
## IgM
## CD4
## CD33
## HLA.DR
## CD14
## CD7
## CD3
## CD123
## pStat1
## pSlp76
## pBtk
## pPlcg2
## pErk
## pLat
## pS6
## pNFkB
## pp38
## pStat5
## pAkt
## pSHP2
## pZap70
## pStat3
library(entropy)
miFullMat <- matrix(NA,nrow=ncol(fullMat),ncol = ncol(fullMat) )
for (i in seq(ncol(fullMat)-1)) {
for (j in seq(i+1,ncol(fullMat))) {
cur.tbl <- table(cutFullMat[,i],cutFullMat[,j])
nent <- 1-mi.empirical(cur.tbl)/entropy.empirical(cur.tbl)
miFullMat[i,j] <- miFullMat[j,i] <- nent
}
}
dimnames(miFullMat) <- list(colnames(fullMat),colnames(fullMat))
hcFullMat <- hclust(as.dist(miFullMat))
library(reshape2)
library(dplyr)
library(ggplot2)
treatment.table <- with(fullAnnots,
table(hc,Treatment))
treatment.pc <- apply(treatment.table,2,function(x) x/sum(x))
av <- andrewsProjection(t(treatment.pc),breaks=40)
treatment.proj <- t(treatment.pc) %*% t(av$freq)
melt(treatment.proj) %>%
rename(AndrewsIndex=Var2) %>%
mutate(AndrewsIndex=av$i[AndrewsIndex]) %>%
ggplot(aes(x=AndrewsIndex,y=value))+
geom_line(aes(group=Treatment,color=Treatment))+
theme_light(base_size=16)+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
For the comparison to be meaningful, we need to limit the analysis to bins that contain a minimum number of cells. This number is arbitrary, and represents the smallest group of cells that should be called a population. We also need to define the number of bootstrap, the significant fold change, and the limit of significance:
# total cells per Hilbert index per treatment
treatment.table <- with(fullAnnots,
table(hc,Treatment))
# minimum number of cells
lim <- 40
# number of bootstraps
N <- 400
# significant fold change
flim <- 2
# limit of significance
p <- 0.95
# bins with enough cells
boot.ind <- rownames(treatment.table)[which(rowSums(treatment.table)>=lim)]
# number of cells in each bootstrap
ncells <- floor(min(colSums(treatment.table[boot.ind,]))/1000)*1000
For this exercise we use 40 as the minimum population size. 218 indices out of 7114 contain enough cells.
Because each treatment corresponds to a different number of cells, we will use a bootstrap approach to identify indices that are significantly different between conditions. we will bootstrap 2000 cells from the selected bins, then check which bin contains at least 2 times more cells in the reference compared to other treatments. This operation will be repeated 400 times, and any bin that is consistently different in 380 bins will be considered significant.
library(abind)
boot.mat <- lapply(seq(1,N),function(x) {
if (x%/%100==x/100) cat(x,'\n')
# bootstrap hilbert indices
cur.boot <- sample(which(hc %in% boot.ind),
size=ncells,
replace=TRUE)
return(table(factor(hc[cur.boot],levels=boot.ind),
droplevels(fullAnnots[cur.boot,'Treatment'])))
}
)
## 100
## 200
## 300
## 400
boot.mat <- abind(boot.mat,along=3)
boot.log <- log2(boot.mat)
boot.fold <- sweep(boot.log[,-1,],c(1,3),boot.log[,1,])
boot.sign <- sign(boot.fold)
boot.sign[abs(boot.fold)<log2(flim)] <- 0
boot.res <- apply(boot.sign,c(1,2),function(x) table(sign(x))[c('-1','0','1')])
boot.res <- aperm(boot.res,c(2,1,3))
boot.res[is.na(boot.res)] <- 0
dimnames(boot.res)[[2]] <- c('-1','0','1')
After the analysis, the following number of bins are found significant:
sig.res <- apply(boot.res[,c('-1','1'),]>(N*p),c(1,3),any)
sum.sig.res <- cbind(colSums(sig.res),
colSums(treatment.table[boot.ind[rowSums(sig.res)>0],colnames(sig.res)]))
p.cells.sig.res <- lapply(colnames(sig.res),
function(cur.cl) {
x <- treatment.table[boot.ind,cur.cl]
return(sum(x[sig.res[,cur.cl]])/sum(x))
}
)
names(p.cells.sig.res) <- colnames(sig.res)
sum.sig.res <- cbind(sum.sig.res,
round(100*unlist(p.cells.sig.res),0))
colnames(sum.sig.res) <- c('Number of indices',
'Number of Significant cells',
'Percent Significant Cells')
kable(sum.sig.res)
Number of indices | Number of Significant cells | Percent Significant Cells | |
---|---|---|---|
BCR/FcR-XL | 3 | 5817 | 1 |
PMA/Ionomycin | 46 | 3999 | 45 |
vanadate | 47 | 1774 | 30 |
Focusing on BCR/FcR-XL, which are the 3 indices that were found significant?
reference | BCR/FcR-XL | PMA/Ionomycin | vanadate | |
---|---|---|---|---|
2098176 | 1 | 97 | 32 | 0 |
4194304 | 69 | 3 | 7 | 13 |
6291455 | 198 | 14 | 23 | 48 |
Compared to the unstimulated sample, cells are decreasing in 2 bins and increasing in a 3rd one. Based on manual gating, what do these bins correspond to?
cur.treatment <- 'BCR/FcR-XL'
sig.cells <- with(fullAnnots,table(droplevels(Cells[hc %in% boot.ind[sig.res[,cur.treatment]]])))
total.sig.cells <- with(fullAnnots,table(droplevels(Cells[Cells %in% names(sig.cells) &
hc %in% boot.ind])))
treatment.sig.cells <- with(fullAnnots,
table(droplevels(Cells[hc %in% boot.ind[sig.res[,cur.treatment]] &
Treatment==cur.treatment])))
treatment.total.cells <- with(fullAnnots,
table(droplevels(Cells[Cells %in% names(sig.cells) &
hc %in% boot.ind &
Treatment==cur.treatment])))
sum.sig.cells <- cbind(sig.cells,
total.sig.cells,
round(100*sig.cells/total.sig.cells),
treatment.sig.cells,
treatment.total.cells,
round(100*treatment.sig.cells/treatment.total.cells))
colnames(sum.sig.cells) <- c('Number of Significant Cells',
'Total number of Cells',
'Percent of Total',
paste('Number of Signigicant Cells in',cur.treatment),
paste('Number of Cells in',cur.treatment),
paste('Percent of',cur.treatment))
kable(sum.sig.cells)
Number of Significant Cells | Total number of Cells | Percent of Total | Number of Signigicant Cells in BCR/FcR-XL | Number of Cells in BCR/FcR-XL | Percent of BCR/FcR-XL | |
---|---|---|---|---|---|---|
igm- | 30 | 288 | 10 | 23 | 119 | 19 |
igm+ | 475 | 1373 | 35 | 91 | 320 | 28 |
All the cells that are found significant are B cells, and they represent a large fraction of all B cells in the BCR/FcR-XL sample, even when only taking the bootstrapped indices into account.