Reputation: 2969
A dataset describes multiple repeating measurements for multiple clusters, with each measurement-cluster pair contained in a single column. I would like to wrangle the data into a long(er) format, such that one column provides information on the cluster, but each measurement remains in its own column.
# Current format
df_wider <- data.frame(
id = 1:5,
fruit_1 = sample(fruit, size = 5),
date_1 = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 5),
number_1 = sample(1:100, 5),
fruit_2 = sample(fruit, size = 5),
date_2 = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 5),
number_2 = sample(1:100, 5),
fruit_3 = sample(fruit, size = 5),
date_3 = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 5),
number_3 = sample(1:100, 5)
)
# Desired format
df_longer <- data.frame(
id = rep(1:5, each = 3),
cluster = rep(1:3, 5),
fruit = sample(fruit, size = 15),
date = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 15),
number = sample(1:100, 15)
)
The real dataset contains up to 25 clusters of 100s of measurements each. I attempted to use tidyr::gather()
and tidyr::pivot_longer()
iterated over each measurement, but the resulting intermediate dataframes increased exponentially in size. Attempting to do so in a single tidyr::pivot_longer()
step is impossible due to the values' being of different class. I am unable to think of a way to vectorize this up to scale.
Upvotes: 2
Views: 142
Reputation: 887008
We can use melt
from data.table
library(data.table)
melt(setDT(df_wider), measure = patterns('^fruit', '^date', '^number' ),
value.name = c('fruit', 'date', 'number'), variable.name = 'cluster')
# id cluster fruit date number
# 1: 1 1 date 2020-04-16 17
# 2: 2 1 quince 2020-01-27 7
# 3: 3 1 coconut 2020-04-19 33
# 4: 4 1 pomegranate 2020-02-27 55
# 5: 5 1 persimmon 2020-02-20 62
# 6: 1 2 kiwi fruit 2020-01-14 100
# 7: 2 2 cranberry 2020-03-15 97
# 8: 3 2 cucumber 2020-03-16 5
# 9: 4 2 persimmon 2020-03-06 81
#10: 5 2 date 2020-04-17 30
#11: 1 3 apricot 2020-04-13 86
#12: 2 3 banana 2020-04-17 42
#13: 3 3 bilberry 2020-02-23 88
#14: 4 3 blackcurrant 2020-02-25 10
#15: 5 3 raisin 2020-02-09 87
Upvotes: 1
Reputation: 16178
You could do:
library(tidyr)
library(dplyr)
df_wider %>% pivot_longer(-id,
names_pattern = "(.*)_(\\d)",
names_to = c(".value", "cluster"))
# A tibble: 15 x 5
id cluster fruit date number
<int> <chr> <fct> <date> <int>
1 1 1 olive 2020-04-21 50
2 1 2 elderberry 2020-02-23 59
3 1 3 cherimoya 2020-03-07 9
4 2 1 jujube 2020-03-22 88
5 2 2 mandarine 2020-03-06 45
6 2 3 grape 2020-04-23 78
7 3 1 nut 2020-01-26 53
8 3 2 cantaloupe 2020-01-27 70
9 3 3 durian 2020-02-15 39
10 4 1 chili pepper 2020-03-17 60
11 4 2 raisin 2020-04-14 20
12 4 3 cloudberry 2020-03-11 4
13 5 1 honeydew 2020-01-04 81
14 5 2 lime 2020-03-23 53
15 5 3 ugli fruit 2020-01-13 26
Upvotes: 1