stixmcvix
stixmcvix

Reputation: 333

Multiple gathering in R to create tidy dataset

I have a complicated untidy dataset which a dummy version of can be replicated below.

studentID <- seq(1:250)
score2018 <- runif(250)
score2019 <- runif(250)
score2020 <- runif(250)
payment2018 <- runif(250, min=10000, max=12000)
payment2019 <- runif(250, min=11000, max=13000)
payment2020 <- runif(250, min=12000, max=14000)
attendance2018 <- runif(250, min=0.75, max=1)
attendance2019 <- runif(250, min=0.75, max=1)
attendance2020 <- runif(250, min=0.75, max=1)

untidy_df <- data.frame(studentID, score2018, score2019, score2020, payment2018, payment2019, payment2020, attendance2018, attendance2019, attendance2020)

I would like to gather this data frame so that we only have 5 columns: studentID, year, score, payment, attendance. I know how to gather at a basic level, but I have 3 sets to gather here, and I can't see how to do this in one go.

Thanks in advance!

Upvotes: 1

Views: 53

Answers (3)

Edward
Edward

Reputation: 18543

Using pure R:

tidy_df <- reshape(untidy_df, direction="long", idvar="studentID", varying=2:10, sep="")
head(tidy_df)

       studentID time      score  payment attendance
1.2018         1 2018 0.86743970 10995.45  0.9473540
2.2018         2 2018 0.53204701 11152.74  0.8167776
3.2018         3 2018 0.90072918 10631.06  0.9335316
4.2018         4 2018 0.89154492 11889.23  0.9098399
5.2018         5 2018 0.06320442 10973.20  0.8118909
6.2018         6 2018 0.67519166 11751.67  0.8328860

If you want "year" instead of the default "time", add timevar="year"

Upvotes: 2

dario
dario

Reputation: 6485

We could try:

library(dplyr)
library(tidyr)

untidy_df %>% 
  pivot_longer(cols = -studentID) %>% 
  separate(col = name, sep = "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)", into = c("measure", "year")) %>% 
  pivot_wider(names_from = measure, values_from = value )

Which returns:

  studentID year  score payment attendance
      <int> <chr> <dbl>   <dbl>      <dbl>
1         1 2018  0.807  10179.      0.974
2         1 2019  0.599  11601.      0.785
3         1 2020  0.515  12347.      0.760
4         2 2018  0.474  11154.      0.983
5         2 2019  0.409  11682.      0.864
6         2 2020  0.688  13756.      0.812
7         3 2018  0.509  11746.      0.870
8         3 2019  0.867  12851.      0.801
9         3 2020  0.878  12710.      0.955
10         4 2018  0.621  11165.      0.975

Upvotes: 1

Ben
Ben

Reputation: 30474

With tidyr you can use pivot_longer:

library(tidyr)

untidy_df %>%
  pivot_longer(cols = -studentID, names_to = c(".value", "year"), names_pattern = "(\\w+)(\\d{4})")

Output

# A tibble: 750 x 5
   studentID year    score payment attendance
       <int> <chr>   <dbl>   <dbl>      <dbl>
 1         1 2018  0.432    10762.      0.786
 2         1 2019  0.948    11340.      0.909
 3         1 2020  0.122    12837.      0.944
 4         2 2018  0.422    11515.      0.950
 5         2 2019  0.0639   12968.      0.828
 6         2 2020  0.611    13645.      0.901
 7         3 2018  0.489    11281.      0.784
 8         3 2019  0.00337  12250.      0.753
 9         3 2020  0.711    12898.      0.803
10         4 2018  0.0596   10526.      0.842

Upvotes: 4

Related Questions