Data wrangling from wide to long format with multiple repeating columns of different types

Question

A dataset describes multiple repeating measurements for multiple clusters, with each measurement-cluster pair contained in a single column. I would like to wrangle the data into a long(er) format, such that one column provides information on the cluster, but each measurement remains in its own column.

# Current format
df_wider <- data.frame(
  id = 1:5,
  fruit_1 = sample(fruit, size = 5),
  date_1 = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 5),
  number_1 = sample(1:100, 5),
  fruit_2 = sample(fruit, size = 5),
  date_2 = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 5),
  number_2 = sample(1:100, 5),
  fruit_3 = sample(fruit, size = 5),
  date_3 = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 5),
  number_3 = sample(1:100, 5)
)

# Desired format
df_longer <- data.frame(
  id = rep(1:5, each = 3),
  cluster = rep(1:3, 5),
  fruit = sample(fruit, size = 15),
  date = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 15),
  number = sample(1:100, 15)
)

The real dataset contains up to 25 clusters of 100s of measurements each. I attempted to use tidyr::gather() and tidyr::pivot_longer() iterated over each measurement, but the resulting intermediate dataframes increased exponentially in size. Attempting to do so in a single tidyr::pivot_longer() step is impossible due to the values' being of different class. I am unable to think of a way to vectorize this up to scale.

dc37 · Accepted Answer

You could do:

library(tidyr)
library(dplyr)

df_wider %>% pivot_longer(-id, 
                          names_pattern = "(.*)_(\d)", 
                          names_to = c(".value", "cluster"))

# A tibble: 15 x 5
      id cluster fruit        date       number
                     
 1     1 1       olive        2020-04-21     50
 2     1 2       elderberry   2020-02-23     59
 3     1 3       cherimoya    2020-03-07      9
 4     2 1       jujube       2020-03-22     88
 5     2 2       mandarine    2020-03-06     45
 6     2 3       grape        2020-04-23     78
 7     3 1       nut          2020-01-26     53
 8     3 2       cantaloupe   2020-01-27     70
 9     3 3       durian       2020-02-15     39
10     4 1       chili pepper 2020-03-17     60
11     4 2       raisin       2020-04-14     20
12     4 3       cloudberry   2020-03-11      4
13     5 1       honeydew     2020-01-04     81
14     5 2       lime         2020-03-23     53
15     5 3       ugli fruit   2020-01-13     26

Data wrangling from wide to long format with multiple repeating columns of different types

Answers (2)

Related Questions