Reputation: 31
I'm struggling to come up with a working solution to what seems like a fairly simple problem. I have a data frame with both data and factors in it, and I'd like to use the factors to decide which data points need to be subtracted from other data points to produce a new data frame of comparative values.
Here's what the data frame is like:
str(means)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 32 obs. of 5 variables:
$ rat : Factor w/ 8 levels "Rat1","Rat2",..: 1 1 1 1 2 2 2 2 3 3 ...
$ gene : Factor w/ 4 levels "gene1","gene2",..: 1 2 3 4 1 2 3 4 1 2 ...
$ gene_category: Factor w/ 2 levels "control","experimental": 2 2 1 1 2 2 1 1 2 2 ...
$ timepoint1 : num 23.4 18.3 42.1 40.1 25.3 ...
$ timepoint2 : num 23.5 18.4 41.5 39.9 22.8 ...
> head(means)
Source: local data frame [6 x 5]
Groups: rat, gene [6]
rat gene gene_category timepoint1 timepoint2
(fctr) (fctr) (fctr) (dbl) (dbl)
1 Rat1 gene1 experimental 23.36667 23.49667
2 Rat1 gene2 experimental 18.26000 18.38000
3 Rat1 gene3 control 42.05500 41.45000
4 Rat1 gene4 control 40.08667 39.89500
5 Rat2 gene1 experimental 25.29333 22.83000
6 Rat2 gene2 experimental 19.72667 19.19333
For each rat (8 rats in total), I'd like to subtract the 'control' gene values (genes 3 and 4) from the 'experimental' gene values (genes 1 and 2). I need to do this iteratively, so each experimental gene value must have each control gene value subtracted from it (within each rat, but not between rats). The above should be done for each timepoint column.
I've been fiddling with a solution using dplyr, I've got the grouping down but I can't figure out how to do the rest:
diffs <- means %>% group_by(rat, gene, gene_category) %>% here_is_where_i_don't_know_what_to_do)
There is a solution here to a similar problem here but I think it will give me every pairwise operation possible, and that's not what I'm looking for. It also only deals with two factors, while I have three I need to consider.
Here's another solution to a similar problem, but again there are some things about it that make it less than ideal. It deals with one factor only and I'm not sure how it could be applied to a dataset with three factors and two data vectors.
I know that this problem is solved when doing something like a pairwise comparison to determine statistical significance (multiple t-tests, ANOVA, MANOVA, etc), but the packages/base stat functions I'm familiar with that do these tests keep this basic operation under the hood. I'd like a simple solution that uses as few loops as possible with either base R or dplyr/plyr/reshape2, etc.
Upvotes: 1
Views: 1907
Reputation: 9570
I think the solution will involve generating the comparisons you want, then passing them to a standard evaluation mutate_
instead of fighting with group_by
and summarize
.
First, here are the data read in (note, added genes 3/4 for rat2):
means <-
read.table(text =
" rat gene gene_category timepoint1 timepoint2
1 Rat1 gene1 experimental 23.36667 23.49667
2 Rat1 gene2 experimental 18.26000 18.38000
3 Rat1 gene3 control 42.05500 41.45000
4 Rat1 gene4 control 40.08667 39.89500
5 Rat2 gene1 experimental 25.29333 22.83000
6 Rat2 gene2 experimental 19.72667 19.19333
7 Rat2 gene3 control 42.05500 41.45000
8 Rat2 gene4 control 40.08667 39.89500")
Next, generate the set of genes within each class:
geneLists <-
means %>%
{split(.$gene, .$`gene_category`)} %>%
lapply(unique) %>%
lapply(as.character) %>%
lapply(function(x){paste0("`", x, "`")})
Note that that backticks "`" are to protect against potentially invalid column names (e.g., things with spaces). This gives:
$control
[1] "`gene3`" "`gene4`"
$experimental
[1] "`gene1`" "`gene2`"
Then, paste together the comparisons you want:
colsToCreate <-
outer(geneLists[["experimental"]]
, geneLists[["control"]]
, paste, sep = " - ") %>%
as.character()
Giving:
[1] "`gene1` - `gene3`" "`gene2` - `gene3`" "`gene1` - `gene4`" "`gene2` - `gene4`"
Then, use tidyr
to spread the data, generating one row per rat. Note, if you want to spread both timepoint1
and timepoint2
, you would likely need to gather
first (put both times in one column), then create an id column with both time and gene, then spread
using that single id column. This would also require changes to the colsToCreate
construction.
After spreading, pass the vector of columns to generate, and you should have what you want:
means %>%
select(rat, gene, timepoint1) %>%
spread(gene, timepoint1) %>%
mutate_(.dots = colsToCreate)
Voila:
rat gene1 gene2 gene3 gene4 gene1 - gene3 gene2 - gene3 gene1 - gene4 gene2 - gene4
1 Rat1 23.36667 18.26000 42.055 40.08667 -18.68833 -23.79500 -16.72000 -21.82667
2 Rat2 25.29333 19.72667 42.055 40.08667 -16.76167 -22.32833 -14.79334 -20.36000
Actually, getting both timepoints is even easier than I had thought it would be:
means %>%
select(-gene_category) %>%
gather("timepoint", "value", starts_with("timepoint")) %>%
spread(gene, value) %>%
mutate_(.dots = colsToCreate)
gives:
rat timepoint gene1 gene2 gene3 gene4 gene1 - gene3 gene2 - gene3 gene1 - gene4 gene2 - gene4
1 Rat1 timepoint1 23.36667 18.26000 42.055 40.08667 -18.68833 -23.79500 -16.72000 -21.82667
2 Rat1 timepoint2 23.49667 18.38000 41.450 39.89500 -17.95333 -23.07000 -16.39833 -21.51500
3 Rat2 timepoint1 25.29333 19.72667 42.055 40.08667 -16.76167 -22.32833 -14.79334 -20.36000
4 Rat2 timepoint2 22.83000 19.19333 41.450 39.89500 -18.62000 -22.25667 -17.06500 -20.70167
Note also that you can name the vector that holds the column calculating formulas, e.g.,:
colsToCreate2 <-
setNames(colsToCreate
, c("nameA", "nameB", "nameC", "nameD"))
means %>%
select(rat, gene, timepoint1) %>%
spread(gene, timepoint1) %>%
mutate_(.dots = colsToCreate2)
gives:
rat gene1 gene2 gene3 gene4 nameA nameB nameC nameD
1 Rat1 23.36667 18.26000 42.055 40.08667 -18.68833 -23.79500 -16.72000 -21.82667
2 Rat2 25.29333 19.72667 42.055 40.08667 -16.76167 -22.32833 -14.79334 -20.36000
I'm not sure why, but this question excites me enough that I wanted to complete the idea. Here, I gather
the comparisons back into long form, then mutate
the timepoint into a number using parse_number
from readr
and separate
out the compared genes into separate columns to allow efficient access and filtering. Do note that the repeated use of each gene removes assumptions of independence so do not do stats on these without a lot of very careful thinking about control.
longForm <-
means %>%
select(-gene_category) %>%
gather("timepoint", "value", starts_with("timepoint")) %>%
spread(gene, value) %>%
mutate_(.dots = colsToCreate) %>%
select_(.dots = paste0("-",unlist(geneLists))) %>%
gather(Comparison, Difference, -rat, -timepoint) %>%
mutate(time = parse_number(timepoint)) %>%
separate(Comparison, c("exp_Gene", "cont_Gene"), " - ")
head(longForm)
gives
rat timepoint exp_Gene cont_Gene Difference time
1 Rat1 timepoint1 gene1 gene3 -18.68833 1
2 Rat1 timepoint2 gene1 gene3 -17.95333 2
3 Rat2 timepoint1 gene1 gene3 -16.76167 1
4 Rat2 timepoint2 gene1 gene3 -18.62000 2
5 Rat1 timepoint1 gene2 gene3 -23.79500 1
6 Rat1 timepoint2 gene2 gene3 -23.07000 2
Then, we can plot the results:
longForm %>%
ggplot(aes(x = time
, y = Difference
, col = rat)) +
geom_line() +
facet_grid(exp_Gene ~ cont_Gene)
Upvotes: 3
Reputation: 49448
Here's a solution using the latest devel version (1.9.7+) of data.table
:
library(data.table)
setDT(means)
# join on rat being same and gene categories not being same, discard unmatched rows
# then extract interesting columns
means[means, on = .(rat, gene_category > gene_category), nomatch = 0,
.(rat, gene.exp = gene, gene.ctrl = i.gene,
timediff1 = timepoint1 - i.timepoint1, timediff2 = timepoint2 - i.timepoint2)]
# rat gene.exp gene.ctrl timediff1 timediff2
#1: Rat1 gene1 gene3 -18.68833 -17.95333
#2: Rat1 gene2 gene3 -23.79500 -23.07000
#3: Rat1 gene1 gene4 -16.72000 -16.39833
#4: Rat1 gene2 gene4 -21.82667 -21.51500
#5: Rat2 gene1 gene3 -16.76167 -18.62000
#6: Rat2 gene2 gene3 -22.32833 -22.25667
#7: Rat2 gene1 gene4 -14.79334 -17.06500
#8: Rat2 gene2 gene4 -20.36000 -20.70167
And if you want to generalize to arbitrary number of "timepoint" columns:
nm = grep("timepoint", names(means), value = T)
means[means, on = .(rat, gene_category > gene_category), nomatch = 0,
c(.(rat = rat, gene.exp = gene, gene.ctrl = i.gene),
setDT(mget(nm)) - mget(paste0('i.', nm)))]
Upvotes: 3