ovon
ovon

Reputation: 31

Iterative subtraction based on factors in a data frame using R

I'm struggling to come up with a working solution to what seems like a fairly simple problem. I have a data frame with both data and factors in it, and I'd like to use the factors to decide which data points need to be subtracted from other data points to produce a new data frame of comparative values.

Here's what the data frame is like:

str(means)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 32 obs. of  5 variables:
 $ rat          : Factor w/ 8 levels "Rat1","Rat2",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ gene         : Factor w/ 4 levels "gene1","gene2",..: 1 2 3 4 1 2 3 4 1 2 ...
 $ gene_category: Factor w/ 2 levels "control","experimental": 2 2 1 1 2 2 1 1 2 2 ...
 $ timepoint1   : num  23.4 18.3 42.1 40.1 25.3 ...
 $ timepoint2   : num  23.5 18.4 41.5 39.9 22.8 ...
> head(means)
Source: local data frame [6 x 5]
Groups: rat, gene [6]

 rat   gene gene_category timepoint1 timepoint2
(fctr) (fctr)        (fctr)      (dbl)      (dbl)
1   Rat1  gene1  experimental   23.36667   23.49667
2   Rat1  gene2  experimental   18.26000   18.38000
3   Rat1  gene3       control   42.05500   41.45000
4   Rat1  gene4       control   40.08667   39.89500
5   Rat2  gene1  experimental   25.29333   22.83000
6   Rat2  gene2  experimental   19.72667   19.19333

For each rat (8 rats in total), I'd like to subtract the 'control' gene values (genes 3 and 4) from the 'experimental' gene values (genes 1 and 2). I need to do this iteratively, so each experimental gene value must have each control gene value subtracted from it (within each rat, but not between rats). The above should be done for each timepoint column.

I've been fiddling with a solution using dplyr, I've got the grouping down but I can't figure out how to do the rest:

diffs <- means %>% group_by(rat, gene, gene_category) %>% here_is_where_i_don't_know_what_to_do)

There is a solution here to a similar problem here but I think it will give me every pairwise operation possible, and that's not what I'm looking for. It also only deals with two factors, while I have three I need to consider.

Here's another solution to a similar problem, but again there are some things about it that make it less than ideal. It deals with one factor only and I'm not sure how it could be applied to a dataset with three factors and two data vectors.

I know that this problem is solved when doing something like a pairwise comparison to determine statistical significance (multiple t-tests, ANOVA, MANOVA, etc), but the packages/base stat functions I'm familiar with that do these tests keep this basic operation under the hood. I'd like a simple solution that uses as few loops as possible with either base R or dplyr/plyr/reshape2, etc.

Upvotes: 1

Views: 1907

Answers (2)

Mark Peterson
Mark Peterson

Reputation: 9570

I think the solution will involve generating the comparisons you want, then passing them to a standard evaluation mutate_ instead of fighting with group_by and summarize.

First, here are the data read in (note, added genes 3/4 for rat2):

means <-
  read.table(text =
" rat   gene gene_category timepoint1 timepoint2
1   Rat1  gene1  experimental   23.36667   23.49667
2   Rat1  gene2  experimental   18.26000   18.38000
3   Rat1  gene3       control   42.05500   41.45000
4   Rat1  gene4       control   40.08667   39.89500
5   Rat2  gene1  experimental   25.29333   22.83000
6   Rat2  gene2  experimental   19.72667   19.19333
7   Rat2  gene3       control   42.05500   41.45000
8   Rat2  gene4       control   40.08667   39.89500")

Next, generate the set of genes within each class:

geneLists <-
  means %>%
  {split(.$gene, .$`gene_category`)} %>%
  lapply(unique) %>%
  lapply(as.character) %>%
  lapply(function(x){paste0("`", x, "`")})

Note that that backticks "`" are to protect against potentially invalid column names (e.g., things with spaces). This gives:

$control
[1] "`gene3`" "`gene4`"

$experimental
[1] "`gene1`" "`gene2`"

Then, paste together the comparisons you want:

colsToCreate <-
  outer(geneLists[["experimental"]]
        , geneLists[["control"]]
        , paste, sep = " - ") %>%
  as.character()

Giving:

[1] "`gene1` - `gene3`" "`gene2` - `gene3`" "`gene1` - `gene4`" "`gene2` - `gene4`"

Then, use tidyr to spread the data, generating one row per rat. Note, if you want to spread both timepoint1 and timepoint2, you would likely need to gather first (put both times in one column), then create an id column with both time and gene, then spread using that single id column. This would also require changes to the colsToCreate construction.

After spreading, pass the vector of columns to generate, and you should have what you want:

means %>%
  select(rat, gene, timepoint1) %>%
  spread(gene, timepoint1) %>%
  mutate_(.dots = colsToCreate)

Voila:

   rat    gene1    gene2  gene3    gene4 gene1 - gene3 gene2 - gene3 gene1 - gene4 gene2 - gene4
1 Rat1 23.36667 18.26000 42.055 40.08667     -18.68833     -23.79500     -16.72000     -21.82667
2 Rat2 25.29333 19.72667 42.055 40.08667     -16.76167     -22.32833     -14.79334     -20.36000

Actually, getting both timepoints is even easier than I had thought it would be:

means %>%
  select(-gene_category) %>%
  gather("timepoint", "value", starts_with("timepoint")) %>%
  spread(gene, value) %>%
  mutate_(.dots = colsToCreate)

gives:

   rat  timepoint    gene1    gene2  gene3    gene4 gene1 - gene3 gene2 - gene3 gene1 - gene4 gene2 - gene4
1 Rat1 timepoint1 23.36667 18.26000 42.055 40.08667     -18.68833     -23.79500     -16.72000     -21.82667
2 Rat1 timepoint2 23.49667 18.38000 41.450 39.89500     -17.95333     -23.07000     -16.39833     -21.51500
3 Rat2 timepoint1 25.29333 19.72667 42.055 40.08667     -16.76167     -22.32833     -14.79334     -20.36000
4 Rat2 timepoint2 22.83000 19.19333 41.450 39.89500     -18.62000     -22.25667     -17.06500     -20.70167

Note also that you can name the vector that holds the column calculating formulas, e.g.,:

colsToCreate2 <-
  setNames(colsToCreate
           , c("nameA", "nameB", "nameC", "nameD"))

means %>%
  select(rat, gene, timepoint1) %>%
  spread(gene, timepoint1) %>%
  mutate_(.dots = colsToCreate2)

gives:

   rat    gene1    gene2  gene3    gene4     nameA     nameB     nameC     nameD
1 Rat1 23.36667 18.26000 42.055 40.08667 -18.68833 -23.79500 -16.72000 -21.82667
2 Rat2 25.29333 19.72667 42.055 40.08667 -16.76167 -22.32833 -14.79334 -20.36000

I'm not sure why, but this question excites me enough that I wanted to complete the idea. Here, I gather the comparisons back into long form, then mutate the timepoint into a number using parse_number from readr and separate out the compared genes into separate columns to allow efficient access and filtering. Do note that the repeated use of each gene removes assumptions of independence so do not do stats on these without a lot of very careful thinking about control.

longForm <-
  means %>%
  select(-gene_category) %>%
  gather("timepoint", "value", starts_with("timepoint")) %>%
  spread(gene, value) %>%
  mutate_(.dots = colsToCreate) %>%
  select_(.dots = paste0("-",unlist(geneLists))) %>%
  gather(Comparison, Difference, -rat, -timepoint) %>%
  mutate(time = parse_number(timepoint)) %>%
  separate(Comparison, c("exp_Gene", "cont_Gene"), " - ")

head(longForm)

gives

   rat  timepoint exp_Gene cont_Gene Difference time
1 Rat1 timepoint1    gene1     gene3  -18.68833    1
2 Rat1 timepoint2    gene1     gene3  -17.95333    2
3 Rat2 timepoint1    gene1     gene3  -16.76167    1
4 Rat2 timepoint2    gene1     gene3  -18.62000    2
5 Rat1 timepoint1    gene2     gene3  -23.79500    1
6 Rat1 timepoint2    gene2     gene3  -23.07000    2

Then, we can plot the results:

longForm %>%
  ggplot(aes(x = time
             , y = Difference
             , col = rat)) +
  geom_line() +
  facet_grid(exp_Gene ~ cont_Gene)

enter image description here

Upvotes: 3

eddi
eddi

Reputation: 49448

Here's a solution using the latest devel version (1.9.7+) of data.table:

library(data.table)
setDT(means)

# join on rat being same and gene categories not being same, discard unmatched rows
# then extract interesting columns
means[means, on = .(rat, gene_category > gene_category), nomatch = 0,
      .(rat, gene.exp = gene, gene.ctrl = i.gene,
        timediff1 = timepoint1 - i.timepoint1, timediff2 = timepoint2 - i.timepoint2)]
#    rat gene.exp gene.ctrl timediff1 timediff2
#1: Rat1    gene1     gene3 -18.68833 -17.95333
#2: Rat1    gene2     gene3 -23.79500 -23.07000
#3: Rat1    gene1     gene4 -16.72000 -16.39833
#4: Rat1    gene2     gene4 -21.82667 -21.51500
#5: Rat2    gene1     gene3 -16.76167 -18.62000
#6: Rat2    gene2     gene3 -22.32833 -22.25667
#7: Rat2    gene1     gene4 -14.79334 -17.06500
#8: Rat2    gene2     gene4 -20.36000 -20.70167

And if you want to generalize to arbitrary number of "timepoint" columns:

nm = grep("timepoint", names(means), value = T)

means[means, on = .(rat, gene_category > gene_category), nomatch = 0,
      c(.(rat = rat, gene.exp = gene, gene.ctrl = i.gene),
        setDT(mget(nm)) - mget(paste0('i.', nm)))]

Upvotes: 3

Related Questions