baptiste
baptiste

Reputation: 77106

long vs wide, tidy vs efficient

I'm finding some suboptimal steps in my current data analysis workflow when it comes to switching between long and wide formats. Consider the three traces shown below, with common x values,

enter image description here

My data is in long format, useful for plotting and all sorts of pipy things, but for some parts of the analysis it seems much easier to deal with a wide (matrix-like) format. For instance, in this dummy example, I might want to set the baseline to 0 for all traces, by subtracting the average of each trace between 0 and 0.25 (shaded grey area).

I cannot find an easy way to do this sort of thing in long format.

My current strategy is to switch back to wide format, but i) I never remember the correct syntax for dcast/reshape, ii) it's quite inefficient to keep going back and forth between the two.

dwide <- reshape2::dcast(dlong, x~..., value.var="y")
dwide[,-1] <- sweep(dwide[,-1], 2, colMeans(dwide[dwide$x < 0.25, -1]), FUN="-")
dlong2 <- melt(dwide, id="x")

enter image description here

Have I missed some tools that could help? I'm open to data.table suggestions.


Full reproducible example:

library(ggplot2)
library(plyr)
library(reshape2)

## dummy data as noisy lorentzian-shaped peaks with random offset

set.seed(1234)
fake_data <- function(a, x = seq(0, 1, length=100)){ 
  data.frame(x = x, 
             y = jitter(1e-3 / ((x - a)^2 + 1e-3) + runif(1,0,1), 
                   amount = 0.1))
}

## apply function to all combinations of parameters (one here)
dlong <- plyr::mdply(data.frame(a = c(0.4,0.5,0.6)), fake_data)

ggplot(dlong, aes(x, y, colour=factor(a))) + geom_line() +
  annotate("rect", xmin=-Inf, xmax=0.25, ymin=-Inf, ymax=Inf, fill="grey", alpha = 0.3) +
  theme_minimal()

dwide <- reshape2::dcast(dlong, x~..., value.var="y")
str(dwide)

dwide[,-1] <- sweep(dwide[,-1], 2, colMeans(dwide[dwide$x < 0.25, -1]), FUN="-")
dlong2 <- melt(dwide, id="x")

ggplot(dlong2, aes(x, value, colour=variable)) + geom_line()  +
  theme_minimal()

Upvotes: 1

Views: 592

Answers (1)

dww
dww

Reputation: 31452

Maybe your minimal example was too trivial to capture all the cases where you might want to go long to wide to long. But for your example at least, I would typically use data.table for this kind of operation:

setDT(dlong)[, y2 := y - mean(y[x < 0.25]), by=a]

ggplot(dlong, aes(x, y2, colour=factor(a))) + 
  geom_line() +
  theme_minimal()

enter image description here

Breaking this down:

  • by = a groups the data so that the operation in the 2nd argument of [.data.table is applied to subsets of dlong corresponding to each value of a

  • y2 := y - mean(y[x < 0.25]) is therefore calculated for each value of a, separately

  • := is a special operator in data.table that provides assignment by reference rather than assignment by copying (its very efficient)

  • the 1st argument of [.datat.table was left blank here, because we wish to operate on all rows of the original dlong data.

Pretty much the same thing can be done with dplyr by

dlong %>% 
  group_by(a) %>% 
  mutate(y2 = y - mean(y[x < 0.25]))

Upvotes: 6

Related Questions