Reputation: 95
I am dealing with datalogger time-series outputs that have common environmental variables (e.g. light, temperature, windspeed) shared within the dataframe across various locations. So, each column is named first by the environmental variable being measured (e.g. "a"), then by its physical location (e.g. "1"), separated by "_".
As an example we can imagine a dataframe with the environmental variables "a", "b", and "c" being measured simultaneously at three different locations. This gives a column name for date-time and each of the six unique combinations of variable-location like so:
"dt" "a_1" "a_2" "a_3" "b_1" "b_2" "b_3" "c_1" "c_2" "c_3"
I need to convert the data frame into long format, so that there is a single column each for "dt", "a", "b", and "c", with a new column "loc" for the location associated with each environmental variable measurement.
The code below creates a mock dataframe and then uses a very cumbersome method to create the output I would like. However, this example code code is too cumbersome to use for large dataframes (i.e. dozens of variables and locations).
How could I make this more efficient by using the information in the column names to automatically transform the data, preferably with a tidyverse approach via tidyr and dplyr?
### Mock data:
start_time <- as.POSIXct("2000-10-01 10:10:10")
df <- data.frame(
dt= seq.POSIXt(from = start_time, length.out = 100, by = 1),
a_1=abs(rnorm(100, 1000, 500)),
b_1=abs(rnorm(100, 35, 5)),
c_1=abs(rnorm(100, 10, 2.5)),
a_2=abs(rnorm(100, 1000, 500)),
b_2=abs(rnorm(100, 35, 5)),
c_2=abs(rnorm(100, 10, 2.5)),
a_3=abs(rnorm(100, 1000, 500)),
b_3=abs(rnorm(100, 35, 5)),
c_3=abs(rnorm(100, 10, 2.5))
)
### New data frames for each location, with location identifier column:
loc1 <- df %>%
select(dt, a_1, b_1, c_1) %>%
rename(a = a_1) %>%
rename(b = b_1) %>%
rename(c = c_1) %>%
mutate(loc = as.character("1"))
loc2 <- df %>%
select(dt, a_2, b_2, c_2) %>%
rename(a = a_2) %>%
rename(b = b_2) %>%
rename(c = c_2) %>%
mutate(loc = as.character("2"))
loc3 <- df %>%
select(dt, a_3, b_3, c_3) %>%
rename(a = a_3) %>%
rename(b = b_3) %>%
rename(c = c_3) %>%
mutate(loc = as.character("3"))
### Data in desired long format:
all_data_long <- rbind(loc1, loc2, loc3)
Upvotes: 1
Views: 1409
Reputation: 740
Using the tidyverse approach, as requested, does this work for you?
library(dplyr)
library(tidyr)
out <- df %>%
gather(Letter, Val, -dt) %>%
separate(Letter, into = c("Letter", "Loc")) %>%
spread(Letter, Val)
Upvotes: 3
Reputation: 28685
You can use data.table::melt
with patterns
for the measure.vars
.
Note: As mentioned by @Istrel, the variable
column here actually indicates which occurence of a column with that pattern the given rows belong to, not the second part of the column name. For example if the a_*
columns were a_1, a_2, a_99, the variable
value corresponding to that last column would still be 3, not 99.
library(data.table)
setDT(df)
all_data_long2 <- melt(df, id.vars = 'dt',
measure.vars = patterns(a = 'a_*', b = 'b_*', c = 'c_*'))
# dt variable a b c
# 1: 2000-10-01 10:10:10 1 1181.68131 30.12497 7.733530
# 2: 2000-10-01 10:10:11 1 402.04443 35.97919 11.972216
# 3: 2000-10-01 10:10:12 1 1002.14735 37.94243 10.570481
# 4: 2000-10-01 10:10:13 1 574.04331 30.69238 11.131428
# 5: 2000-10-01 10:10:14 1 221.77960 36.41496 5.349643
# ---
# 296: 2000-10-01 10:11:45 3 900.11802 36.16800 8.150693
# 297: 2000-10-01 10:11:46 3 820.79518 34.56636 10.771145
# 298: 2000-10-01 10:11:47 3 825.68334 29.42049 14.811727
# 299: 2000-10-01 10:11:48 3 17.55973 42.44830 14.625586
# 300: 2000-10-01 10:11:49 3 971.93711 37.43062 11.339470
After some small changes this is the same as your output
setnames(all_data_long2, 'variable', 'loc')
all_data_long2$loc <- as.character(all_data_long2$loc)
all.equal(all_data_long,
all_data_long2[,names(all_data_long), with = F],
check.attributes = F)
# [1] TRUE
Benchmark shows a large relative difference in speed, but neither method takes even a second on this test data so it won't matter unless your real data is much larger.
f.dt <- function(df){
setDT(df)
melt(df, id.vars = 'dt',
measure.vars = patterns(a = 'a_*', b = 'b_*', c = 'c_*'))
}
f.tidy <- function(df){
df %>%
gather(Letter, Val, -dt) %>%
separate(Letter, into = c("Letter", "Loc")) %>%
spread(Letter, Val)
}
library(microbenchmark)
microbenchmark(f.dt(df), f.tidy(df))
# Unit: microseconds
# expr min lq mean median uq max neval
# f.dt(df) 236.308 296.2055 533.9328 379.488 463.590 10763.90 100
# f.tidy(df) 6816.003 7498.2600 9466.6462 7899.900 9653.132 44043.92 100
Upvotes: 2
Reputation: 2588
You can use tidyr
separate
function on melted data frame to split your column names into parameter and location. Next step is dcast
to wider (still long) format with separate columns for each parameter.
library(reshape2)
library(tidyr)
library(dplyr)
df <- data.frame(
dt= seq.POSIXt(from = start_time, length.out = 100, by = 1),
a_1=abs(rnorm(100, 1000, 500)),
b_1=abs(rnorm(100, 35, 5)),
c_1=abs(rnorm(100, 10, 2.5)),
a_2=abs(rnorm(100, 1000, 500)),
b_2=abs(rnorm(100, 35, 5)),
c_2=abs(rnorm(100, 10, 2.5)),
a_3=abs(rnorm(100, 1000, 500)),
b_3=abs(rnorm(100, 35, 5)),
c_3=abs(rnorm(100, 10, 2.5))
)
df_long <- melt(df, "dt") %>%
separate(variable, c("param", "loc") ) %>%
dcast(dt + loc ~ param)
head(df_long)
dt loc a b c
1 2000-10-01 10:10:10 1 1131.0953 47.29221 10.195120
2 2000-10-01 10:10:10 2 1734.8935 36.09479 9.156366
3 2000-10-01 10:10:10 3 2153.6998 31.95065 8.786107
4 2000-10-01 10:10:11 1 201.1407 34.64221 13.548707
5 2000-10-01 10:10:11 2 1874.0571 40.27503 8.622356
6 2000-10-01 10:10:11 3 867.9888 38.17056 10.339052
Upvotes: 2