--- title: "Identifying Treatment effects using hilbertSimilarity" author: "Yann Abraham" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Identifying Treatment effects using hilbertSimilarity} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r,echo=FALSE} library(knitr) opts_chunk$set(warning=FALSE, fig.width=8, fig.height=8, fig.retina=1, fig.keep='high', fig.align='center') ``` # Introduction Comparing samples defined over a single dimension is a straightforward task that relies on standard, well established methods. Meanwhile distance between samples in high dimensional space remains a largely unexplored field. Available solutions rely on multivariate normal distributions, a condition that is both difficult to check and overlooking key behaviors in biological samples, where populations of interest often correspond to a small proportion (<1%) of all the points that have been measured. We have developed `hilbertSimilarity` to address the problem of sample similarity in mass cytometry where samples are measured over up to 100 dimensions, and where each sample contains a slightly different number of points (or cells). Our method first transforms each sample into a probability vector, by dividing each dimension into a fixed number of bins and by associating each cell to a specific multidimensional cube. The proportion of cells in each hypercube is characteristic of a given sample. To characterize an compare samples we use measures from Information Theory, since their interpretation in terms of similarity and complexity is straightforward. After determining sample similarity, our method can also be used to determine which cells are different between samples, by comparing the number of cells in each hypercube to a reference treatment. Since every sample contains a different number of cells, we use a bootstrap approach where the same number of cells is sampled from each treatment to allow for a direct comparison. To demonstrate the power of `hilbertSimilarity` we applied the method to a subset of the bodenmiller *et al.* dataset, comparing the effect of different stimulations and identifying groups of cells that are significantly affected by different treatments. Compared to other methods, `hilbertSimilarity` does not rely on expert-driven gating, or require any hypothesis about the number of populations that are present in the data. This makes it a powerful tool to quickly assess a dataset before using traditional methods, or when populations a not known *a priori*. # Installation `hilbertSimilarity` can be installed using the following command ``` devtools::install_github(yannabraham/hilbertSimilarity) ``` Once installed the package can be loaded using the standard `library` command. ```{r} library(hilbertSimilarity) ``` # Loading some test data We first need data from the `bodenmiller` package: ```{r} library(bodenmiller) data(refPhenoMat) data(untreatedPhenoMat) data(refFuncMat) data(untreatedFuncMat) data(refAnnots) refAnnots$Treatment <- 'reference' data(untreatedAnnots) fullAnnots <- rbind(refAnnots[,names(untreatedAnnots)], untreatedAnnots) fullAnnots$Treatment <- factor(fullAnnots$Treatment) fullAnnots$Treatment <- relevel(fullAnnots$Treatment,'reference') refMat <- cbind(refPhenoMat,refFuncMat) untreatedMat <- cbind(untreatedPhenoMat, untreatedFuncMat) fullMat <- rbind(refMat,untreatedMat) fullMat <- apply(fullMat,2,function(x) { x[x<0] <- 0 return(x) } ) ``` In this dataset `r nrow(fullMat)` cells corresponding to `r nlevels(fullAnnots$Treatment)` have been measured over `r ncol(fullMat)` channels. Cells have been gated into `r nlevels(fullAnnots$Cells)` populations. The following treatments are compared to one another: ```{r,echo=FALSE} nCellTreats <- with(fullAnnots,table(Cells,Treatment)) pCellTreats <- apply(nCellTreats,2,function(x) x/sum(x)) kable(nCellTreats) ``` The smallest cell population, `r names(which.min(100*table(fullAnnots$Cells)/nrow(fullAnnots)))`, corresponds to `r round(min(100*table(fullAnnots$Cells)/nrow(fullAnnots)),3)`% of the total cells. # Computing the similarity between treatments We will first compute the similarity between treatments, by generating a 3^rd^-order Hilbert curve using combined cuts: * computing the cuts ```{r} # compute the nbins <- 3 cuts <- make.cut(fullMat, n=nbins+1, count.lim=40) ``` * cutting the matrix ```{r} cutFullMat <- do.cut(fullMat,cuts,type='combined') ``` * ordering the dimensions ```{r} library(entropy) miFullMat <- matrix(NA,nrow=ncol(fullMat),ncol = ncol(fullMat) ) for (i in seq(ncol(fullMat)-1)) { for (j in seq(i+1,ncol(fullMat))) { cur.tbl <- table(cutFullMat[,i],cutFullMat[,j]) nent <- 1-mi.empirical(cur.tbl)/entropy.empirical(cur.tbl) miFullMat[i,j] <- miFullMat[j,i] <- nent } } dimnames(miFullMat) <- list(colnames(fullMat),colnames(fullMat)) hcFullMat <- hclust(as.dist(miFullMat)) ``` * generating the Hilbert curve ```{r} hc <- do.hilbert(cutFullMat[,rev(hcFullMat$order)],2) ``` * Visualizing the different samples ```{r} library(reshape2) library(dplyr) library(ggplot2) treatment.table <- with(fullAnnots, table(hc,Treatment)) treatment.pc <- apply(treatment.table,2,function(x) x/sum(x)) av <- andrewsProjection(t(treatment.pc),breaks=40) treatment.proj <- t(treatment.pc) %*% t(av$freq) melt(treatment.proj) %>% rename(AndrewsIndex=Var2) %>% mutate(AndrewsIndex=av$i[AndrewsIndex]) %>% ggplot(aes(x=AndrewsIndex,y=value))+ geom_line(aes(group=Treatment,color=Treatment))+ theme_light(base_size=16)+ theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), axis.title.y=element_blank(), axis.text.y=element_blank(), axis.ticks.y=element_blank()) ``` # Identifying treatment effects For the comparison to be meaningful, we need to limit the analysis to bins that contain a minimum number of cells. This number is arbitrary, and represents the smallest group of cells that should be called a population. We also need to define the number of bootstrap, the significant fold change, and the limit of significance: ```{r} # total cells per Hilbert index per treatment treatment.table <- with(fullAnnots, table(hc,Treatment)) # minimum number of cells lim <- 40 # number of bootstraps N <- 400 # significant fold change flim <- 2 # limit of significance p <- 0.95 # bins with enough cells boot.ind <- rownames(treatment.table)[which(rowSums(treatment.table)>=lim)] # number of cells in each bootstrap ncells <- floor(min(colSums(treatment.table[boot.ind,]))/1000)*1000 ``` For this exercise we use `r lim` as the minimum population size. `r length(boot.ind)` indices out of `r nrow(treatment.table)` contain enough cells. Because each treatment corresponds to a different number of cells, we will use a bootstrap approach to identify indices that are significantly different between conditions. we will bootstrap `r ncells` cells from the selected bins, then check which bin contains at least `r flim` times more cells in the reference compared to other treatments. This operation will be repeated `r N` times, and any bin that is consistently different in `r p*N` bins will be considered significant. ```{r} library(abind) boot.mat <- lapply(seq(1,N),function(x) { if (x%/%100==x/100) cat(x,'\n') # bootstrap hilbert indices cur.boot <- sample(which(hc %in% boot.ind), size=ncells, replace=TRUE) return(table(factor(hc[cur.boot],levels=boot.ind), droplevels(fullAnnots[cur.boot,'Treatment']))) } ) boot.mat <- abind(boot.mat,along=3) boot.log <- log2(boot.mat) boot.fold <- sweep(boot.log[,-1,],c(1,3),boot.log[,1,]) boot.sign <- sign(boot.fold) boot.sign[abs(boot.fold)(N*p),c(1,3),any) sum.sig.res <- cbind(colSums(sig.res), colSums(treatment.table[boot.ind[rowSums(sig.res)>0],colnames(sig.res)])) p.cells.sig.res <- lapply(colnames(sig.res), function(cur.cl) { x <- treatment.table[boot.ind,cur.cl] return(sum(x[sig.res[,cur.cl]])/sum(x)) } ) names(p.cells.sig.res) <- colnames(sig.res) sum.sig.res <- cbind(sum.sig.res, round(100*unlist(p.cells.sig.res),0)) colnames(sum.sig.res) <- c('Number of indices', 'Number of Significant cells', 'Percent Significant Cells') kable(sum.sig.res) ``` Focusing on **BCR/FcR-XL**, which are the `r sum(sig.res[,'BCR/FcR-XL'])` indices that were found significant? ```{r} kable(treatment.table[boot.ind[sig.res[,'BCR/FcR-XL']],]) ``` Compared to the unstimulated sample, cells are decreasing in 2 bins and increasing in a 3^rd^ one. Based on manual gating, what do these bins correspond to? ```{r} cur.treatment <- 'BCR/FcR-XL' sig.cells <- with(fullAnnots,table(droplevels(Cells[hc %in% boot.ind[sig.res[,cur.treatment]]]))) total.sig.cells <- with(fullAnnots,table(droplevels(Cells[Cells %in% names(sig.cells) & hc %in% boot.ind]))) treatment.sig.cells <- with(fullAnnots, table(droplevels(Cells[hc %in% boot.ind[sig.res[,cur.treatment]] & Treatment==cur.treatment]))) treatment.total.cells <- with(fullAnnots, table(droplevels(Cells[Cells %in% names(sig.cells) & hc %in% boot.ind & Treatment==cur.treatment]))) sum.sig.cells <- cbind(sig.cells, total.sig.cells, round(100*sig.cells/total.sig.cells), treatment.sig.cells, treatment.total.cells, round(100*treatment.sig.cells/treatment.total.cells)) colnames(sum.sig.cells) <- c('Number of Significant Cells', 'Total number of Cells', 'Percent of Total', paste('Number of Signigicant Cells in',cur.treatment), paste('Number of Cells in',cur.treatment), paste('Percent of',cur.treatment)) kable(sum.sig.cells) ``` All the cells that are found significant are B cells, and they represent a large fraction of all B cells in the `r cur.treatment` sample, even when only taking the bootstrapped indices into account.