Reputation: 77106
I'm finding some suboptimal steps in my current data analysis workflow when it comes to switching between long and wide formats. Consider the three traces shown below, with common x
values,
My data is in long format, useful for plotting and all sorts of pipy things, but for some parts of the analysis it seems much easier to deal with a wide (matrix-like) format. For instance, in this dummy example, I might want to set the baseline to 0 for all traces, by subtracting the average of each trace between 0 and 0.25 (shaded grey area).
I cannot find an easy way to do this sort of thing in long format.
My current strategy is to switch back to wide format, but i) I never remember the correct syntax for dcast/reshape, ii) it's quite inefficient to keep going back and forth between the two.
dwide <- reshape2::dcast(dlong, x~..., value.var="y")
dwide[,-1] <- sweep(dwide[,-1], 2, colMeans(dwide[dwide$x < 0.25, -1]), FUN="-")
dlong2 <- melt(dwide, id="x")
Have I missed some tools that could help? I'm open to data.table suggestions.
Full reproducible example:
library(ggplot2)
library(plyr)
library(reshape2)
## dummy data as noisy lorentzian-shaped peaks with random offset
set.seed(1234)
fake_data <- function(a, x = seq(0, 1, length=100)){
data.frame(x = x,
y = jitter(1e-3 / ((x - a)^2 + 1e-3) + runif(1,0,1),
amount = 0.1))
}
## apply function to all combinations of parameters (one here)
dlong <- plyr::mdply(data.frame(a = c(0.4,0.5,0.6)), fake_data)
ggplot(dlong, aes(x, y, colour=factor(a))) + geom_line() +
annotate("rect", xmin=-Inf, xmax=0.25, ymin=-Inf, ymax=Inf, fill="grey", alpha = 0.3) +
theme_minimal()
dwide <- reshape2::dcast(dlong, x~..., value.var="y")
str(dwide)
dwide[,-1] <- sweep(dwide[,-1], 2, colMeans(dwide[dwide$x < 0.25, -1]), FUN="-")
dlong2 <- melt(dwide, id="x")
ggplot(dlong2, aes(x, value, colour=variable)) + geom_line() +
theme_minimal()
Upvotes: 1
Views: 592
Reputation: 31452
Maybe your minimal example was too trivial to capture all the cases where you might want to go long to wide to long. But for your example at least, I would typically use data.table for this kind of operation:
setDT(dlong)[, y2 := y - mean(y[x < 0.25]), by=a]
ggplot(dlong, aes(x, y2, colour=factor(a))) +
geom_line() +
theme_minimal()
Breaking this down:
by = a
groups the data so that the operation in the 2nd argument of [
.data.table is applied to subsets of dlong corresponding to each value of a
y2 := y - mean(y[x < 0.25])
is therefore calculated for each value of a, separately
:=
is a special operator in data.table that provides assignment by reference rather than assignment by copying (its very efficient)
the 1st argument of [
.datat.table was left blank here, because we wish to operate on all rows of the original dlong data.
Pretty much the same thing can be done with dplyr
by
dlong %>%
group_by(a) %>%
mutate(y2 = y - mean(y[x < 0.25]))
Upvotes: 6