biographr: June 2016

I often have to compare ordered lists of genes, for example, for each of two different experiments, I might have estimates of the probability that a given gene is upregulated in that dataset: the values running from 0 to 1 for genes that are likely to be down- or up-regulated.

Let's set up a visual tool for comparing gene lists that are ordered in this way. Suppose we were comparing just a single pair of experiments on 10000 genes (eg, comparing ordered lists of human genes from one experiment with their mouse paralogs in another). We could set up two horizontal panels, each containing 10000 rectangles with the i'th rectangle in experiment A corresponding to the i'th ordered gene in that experiment. If the genes have the same order in the two experiments you've ballsed something up quite royally and will be mocked forevermore.

In gene-set-enrichment analysis, an analogous horizontal panel of ordered genes is given (though only a single panel is presented). This panel sits underneath a line graph that indicates if functionally-related genes cluster together in regions of 'ordered-gene-space' (so that you can identify if a set of genes is up- or down-regulated en masse by looking at either end of the line graph).

I want to set up a figure that looks as follows:

- | | + -- Indication of which genes are upregd
(or downregd) in a given expt
||||||---------|||||| -- Ordered genes from experiment 1
| |\ X //| -- Links between corresponding genes
||||||---------|||||| -- Ordered genes from experiment 2

As always, I want to do this in ggplot2.

A) First we set up some sample data.
B) Then we work out how to plot the heatmap-type horizontal panels that correspond to the ordered genes
C) Then we work out how to connect the rectangles for a given gene from two (or more) horizontal panels using line segments (and extend to multiple genes)
D) Then we plot the linking line-segments over the top of the horizontal panels of genes
E) Then we realise it doesn't show what we wanted and think up something else instead.

--------
## A) Set up some sample data
# Simplest case: All values are distinct with no missing values
# and the same genes are present in each of the meta-analyses
genes <- c(letters, toupper(letters))

# Set up three different datasets:
geneL <- Map(
function(dname){
# Choose some value between 0.9 and 1 to use to decide if a gene is
# significant in THIS dataset
sig.level <- runif(1, 0.9, 1)

# define the signed parity volumes for a set of 52 genes
dframe <- data.frame(gene.id = genes,
spv = runif(n = 52, min = -1, max = 1))

# parity volume
# - our measure of inter-dataset similarity for a gene in a meta-analysis
dframe$pv = abs(dframe$spv)

# rank of the signed parity volumes, low SPV => low rank
dframe$srank = rank(dframe$spv)

# is the gene significant?
dframe$sig <- (dframe$pv > sig.level) * sign(dframe$spv)

# name of the current collection of datasets (to be used in ggplot)
dframe$collectn = factor(dname)

dframe
},
c('metaA', 'metaB', 'metaC')
)

geneDF <- do.call('rbind', geneL)

head(geneDF)
gene.id spv pv srank sig collectn
metaA.1 a -0.2557522 0.2557522 13 0 metaA
metaA.2 b 0.1457067 0.1457067 27 0 metaA
metaA.3 c 0.8164156 0.8164156 49 0 metaA
metaA.4 d -0.5966361 0.5966361 8 0 metaA
metaA.5 e 0.7967794 0.7967794 48 0 metaA

metaA.6 f 0.8893505 0.8893505 51 0 metaA

summary(geneDF)
gene.id spv pv srank
a : 3 Min. :-0.97384 Min. :0.0008819 Min. : 1.00
A : 3 1st Qu.:-0.37879 1st Qu.:0.2023146 1st Qu.:13.75
b : 3 Median :-0.00186 Median :0.4398626 Median :26.50
B : 3 Mean : 0.02130 Mean :0.4484313 Mean :26.50
c : 3 3rd Qu.: 0.45022 3rd Qu.:0.6607802 3rd Qu.:39.25
C : 3 Max. : 0.98537 Max. :0.9853681 Max. :52.00
(Other):138
sig collectn
Min. :-1.00000 metaA:52
1st Qu.: 0.00000 metaB:52
Median : 0.00000 metaC:52
Mean : 0.00641
3rd Qu.: 0.00000

Max. : 1.00000

--------
B) Plot out the horizontal panels using ggplot:
# Attempt to write a ggplot geom_tile based method to simply plot the ranked
# spv values
# height = 0.5 was chosen so there is space between the datasets
#
fig1 <- function(DF = geneDF){
ggplot(
data = DF, aes(x = srank, y = collectn)
) +
geom_tile(aes(fill = sig, height = 0.5)) +
scale_fill_gradient2(low = 'blue', high = 'red', mid = 'white')
}

fig1()

So each horiziontal panel contains a list of 52 genes, and between 0 and 3 of the genes have been (randomly) assigned to be significantly up- or down-regulated in the three analyses.

We want to see whether the significant genes in one analysis correspond to the significant genes in another. To do this we use geom_segment

-------
C) Then we work out how to connect the rectangles for a given gene from two (or more) horizontal panels using line segments (and extend to multiple genes), ... and ...

D) Then we plot the linking line-segments over the top of the horizontal panels of genes

# Function to obtain the midpoints of the gene-specific panels
gene_twoclass_midpoints <- function(
DF,
ref.class,
alt.class,
genes,
class.colname,
gene.colname,
rank.colname,
offset = 0
){
# returns a data frame with entries xmid.ref, ymid.ref, xmid.alt, ymid.alt
# to be used as 'data' in geom_segment (x = x.ref, xend = x.alt, etc...)

# Is the ref.class numerically lower than the alt.class (if so, ref is
# plotted below alt in the panels)
class.lvls <- levels(DF[, class.colname])
lapply(c(ref.class, alt.class),
function(classname){ stopifnot(classname %in% class.lvls) })
if(which(class.lvls == ref.class) < which(class.lvls == alt.class))
offset.multiplier <- 1
else {
offset.multiplier <- -1
}

# Vertical coordinate of the midpoints is given by the numeric value of the
# class
# Horizontal coordinate of the midpoint is given by the rank of the gene
# within the dataset
class.coords <- Map(
function(classname, multiplier){
class.rows <- which(DF[, class.colname] == classname &
DF[, gene.colname] %in% genes)
class.df <- DF[class.rows, ]
ymid <- which(class.lvls == classname)
data.frame(
gene = class.df[, gene.colname],
x = class.df[, rank.colname],
y = ymid + offset * multiplier
)
},
c(ref.class, alt.class),
c(offset.multiplier, -1 * offset.multiplier)
)
# x/y coords for the two datasets are joined based on gene id
segment.coords <- merge(
x = class.coords[[1]],
y = class.coords[[2]],
by = 'gene',
suffixes = c('.ref', '.alt')
)
}

####
fig2 <- function(DF = geneDF, ref = 'metaA', offset = 0){
# non-reference analyses:
alt.classes <- setdiff(levels(DF$collectn), ref)
# For each dataset make a plots of the genes, ranked by signed parity volume,
# with significantly diffexed genes identified in blue (downregd) or
# red(upregd):
p <- ggplot(
data = DF, aes(x = srank, y = collectn)) +
geom_tile(aes(fill = sig, height = 0.5)) +
scale_fill_gradient2(low = 'blue', high = 'red', mid = 'white')
# use geom_segment to plot lines between (x,y) and (xend,yend)
ref.sample.sig.genes <- subset(DF,
collectn == ref & sig != 0
)$gene.id
# obtain the coordinates of the mid-points for plotting the line segments
coords <- do.call('rbind', Map(
function(alt.class){
gene_twoclass_midpoints(
DF,
ref.class = ref,
alt.class = alt.class,
genes = ref.sample.sig.genes,
class.colname = 'collectn',
gene.colname = 'gene.id',
rank.colname = 'srank',
offset = offset
)
},
alt.classes
))

q <- geom_segment(data = coords,
aes(x=x.ref, y=y.ref, xend = x.alt, yend = y.alt))
p + q
}

fig2()

This looks perfectly fine, but I'd rather the links joined top-to-bottom rather than midpoint-to-midpoint. That is, if a class A is plotted below a class B, then the top point of the panel for class A should link to the bottom point of the panel for class B. To do that, I just change the offset to 0.25 (since the panel heights are 0.5).

fig2(offset = 0.25)

biographr

Thursday, 23 June 2016

parallel GSEA-type plots for diffex-metaanalysis