Reputation: 95

Gather multiple variables based on column names that specify variable and identifier

I am dealing with datalogger time-series outputs that have common environmental variables (e.g. light, temperature, windspeed) shared within the dataframe across various locations. So, each column is named first by the environmental variable being measured (e.g. "a"), then by its physical location (e.g. "1"), separated by "_".

As an example we can imagine a dataframe with the environmental variables "a", "b", and "c" being measured simultaneously at three different locations. This gives a column name for date-time and each of the six unique combinations of variable-location like so:

"dt" "a_1" "a_2" "a_3" "b_1" "b_2" "b_3" "c_1" "c_2" "c_3"

I need to convert the data frame into long format, so that there is a single column each for "dt", "a", "b", and "c", with a new column "loc" for the location associated with each environmental variable measurement.

The code below creates a mock dataframe and then uses a very cumbersome method to create the output I would like. However, this example code code is too cumbersome to use for large dataframes (i.e. dozens of variables and locations).

How could I make this more efficient by using the information in the column names to automatically transform the data, preferably with a tidyverse approach via tidyr and dplyr?

### Mock data:
start_time <- as.POSIXct("2000-10-01 10:10:10")
df <- data.frame(
    dt= seq.POSIXt(from = start_time, length.out = 100, by = 1),
    a_1=abs(rnorm(100, 1000, 500)),
    b_1=abs(rnorm(100, 35, 5)),
    c_1=abs(rnorm(100, 10, 2.5)),
    a_2=abs(rnorm(100, 1000, 500)),
    b_2=abs(rnorm(100, 35, 5)),
    c_2=abs(rnorm(100, 10, 2.5)),
    a_3=abs(rnorm(100, 1000, 500)),
    b_3=abs(rnorm(100, 35, 5)),
    c_3=abs(rnorm(100, 10, 2.5))
)

### New data frames for each location, with location identifier column:
loc1 <- df %>%
  select(dt, a_1, b_1, c_1) %>%
  rename(a = a_1) %>%
  rename(b = b_1) %>%
  rename(c = c_1) %>%
  mutate(loc = as.character("1"))

loc2 <- df %>%
  select(dt, a_2, b_2, c_2) %>%
  rename(a = a_2) %>%
  rename(b = b_2) %>%
  rename(c = c_2) %>%
  mutate(loc = as.character("2"))

loc3 <- df %>%
  select(dt, a_3, b_3, c_3) %>%
  rename(a = a_3) %>%
  rename(b = b_3) %>%
  rename(c = c_3) %>%
  mutate(loc = as.character("3"))

### Data in desired long format:
all_data_long <- rbind(loc1, loc2, loc3)

Upvotes: 1

Answers (3)

user2602640

Reputation: 740

Using the tidyverse approach, as requested, does this work for you?

library(dplyr)
library(tidyr)
out <- df %>% 
  gather(Letter, Val, -dt) %>% 
  separate(Letter, into = c("Letter", "Loc")) %>% 
  spread(Letter, Val)

Upvotes: 3

IceCreamToucan

Reputation: 28685

You can use data.table::melt with patterns for the measure.vars.

Note: As mentioned by @Istrel, the variable column here actually indicates which occurence of a column with that pattern the given rows belong to, not the second part of the column name. For example if the a_* columns were a_1, a_2, a_99, the variable value corresponding to that last column would still be 3, not 99.

library(data.table)
setDT(df)

all_data_long2 <- melt(df, id.vars = 'dt', 
                       measure.vars = patterns(a = 'a_*', b = 'b_*', c = 'c_*'))

#                       dt variable          a        b         c
#   1: 2000-10-01 10:10:10        1 1181.68131 30.12497  7.733530
#   2: 2000-10-01 10:10:11        1  402.04443 35.97919 11.972216
#   3: 2000-10-01 10:10:12        1 1002.14735 37.94243 10.570481
#   4: 2000-10-01 10:10:13        1  574.04331 30.69238 11.131428
#   5: 2000-10-01 10:10:14        1  221.77960 36.41496  5.349643
#  ---                                                           
# 296: 2000-10-01 10:11:45        3  900.11802 36.16800  8.150693
# 297: 2000-10-01 10:11:46        3  820.79518 34.56636 10.771145
# 298: 2000-10-01 10:11:47        3  825.68334 29.42049 14.811727
# 299: 2000-10-01 10:11:48        3   17.55973 42.44830 14.625586
# 300: 2000-10-01 10:11:49        3  971.93711 37.43062 11.339470

After some small changes this is the same as your output

setnames(all_data_long2, 'variable', 'loc')

all_data_long2$loc <- as.character(all_data_long2$loc)

all.equal(all_data_long, 
          all_data_long2[,names(all_data_long), with = F],
          check.attributes = F)
# [1] TRUE

Benchmark shows a large relative difference in speed, but neither method takes even a second on this test data so it won't matter unless your real data is much larger.

f.dt <- function(df){
  setDT(df)
  melt(df, id.vars = 'dt', 
        measure.vars = patterns(a = 'a_*', b = 'b_*', c = 'c_*')) 
}

f.tidy <- function(df){
  df %>% 
    gather(Letter, Val, -dt) %>% 
    separate(Letter, into = c("Letter", "Loc")) %>% 
    spread(Letter, Val)
}

library(microbenchmark)
microbenchmark(f.dt(df), f.tidy(df))
 # Unit: microseconds
 #       expr      min        lq      mean   median       uq      max neval
 #   f.dt(df)  236.308  296.2055  533.9328  379.488  463.590 10763.90   100
 # f.tidy(df) 6816.003 7498.2600 9466.6462 7899.900 9653.132 44043.92   100

Upvotes: 2

Istrel

Reputation: 2588

You can use tidyr separate function on melted data frame to split your column names into parameter and location. Next step is dcast to wider (still long) format with separate columns for each parameter.

library(reshape2)
library(tidyr)
library(dplyr)

df <- data.frame(
    dt= seq.POSIXt(from = start_time, length.out = 100, by = 1),
    a_1=abs(rnorm(100, 1000, 500)),
    b_1=abs(rnorm(100, 35, 5)),
    c_1=abs(rnorm(100, 10, 2.5)),
    a_2=abs(rnorm(100, 1000, 500)),
    b_2=abs(rnorm(100, 35, 5)),
    c_2=abs(rnorm(100, 10, 2.5)),
    a_3=abs(rnorm(100, 1000, 500)),
    b_3=abs(rnorm(100, 35, 5)),
    c_3=abs(rnorm(100, 10, 2.5))
)

df_long <- melt(df, "dt") %>% 
    separate(variable, c("param", "loc") ) %>% 
    dcast(dt + loc ~ param)

head(df_long)

               dt loc         a        b         c
1 2000-10-01 10:10:10   1 1131.0953 47.29221 10.195120
2 2000-10-01 10:10:10   2 1734.8935 36.09479  9.156366
3 2000-10-01 10:10:10   3 2153.6998 31.95065  8.786107
4 2000-10-01 10:10:11   1  201.1407 34.64221 13.548707
5 2000-10-01 10:10:11   2 1874.0571 40.27503  8.622356
6 2000-10-01 10:10:11   3  867.9888 38.17056 10.339052

Upvotes: 2

Gather multiple variables based on column names that specify variable and identifier

Answers (3)

Related Questions