Tidying messy data - collecting all observations from a particular statistical unit to one row

Question

I have quite a large dataset in R where all observations from all time-points are assigned as variables (columns). The problem is that for one particular statistical unit there is also one row assigned to each time-point so the data frame looks following:

Unit   TIME    VAR1-time1  VAR2-time1 VAR3-time1 VAR1-time2 VAR2-time2 VAR3-time2 VAR1-time3 VAR2-time3 VAR3-time3     
 1     time1      2.0         Stuff      4.0         NA         NA       NA         NA          NA           NA
 1     time2       NA          NA         NA         4.0       Stuff     2.0        NA          NA           NA
 1     time3       NA          NA         NA         NA         NA       NA         2.0        Stuff         4.0
 2     time1      2.0         Stuff      4.0         NA         NA       NA         NA          NA           NA
 2     time2       NA          NA         NA         4.0       Stuff     2.0        NA          NA           NA
 2     time3       NA          NA         NA         NA         NA       NA         2.0        Stuff         4.0
 3     time1      2.0         Stuff      4.0         NA         NA       NA         NA          NA           NA
 3     time2       NA          NA         NA         4.0       Stuff     2.0        NA          NA           NA
 3     time3       NA          NA         NA         NA         NA       NA         2.0        Stuff         4.0

Data in the rows are stored in correct columns but TIME variable in this case is useless since variables (columns) themselves already contain the information about the timepoint. I would like to get rid of multiple rows for each time point and have all the data from one particular unit on one row.

Is there a convinient way to tidy up this kind of data? Maybe with dplyr or tidyR-package? The data frame has thousands of different variables (columns) with hundreds of different timepoints (rows) for each statistical unit.

df %>% 
group_by(Unit) %>%
summarise_all(mean,na.rm=T)

Seems to do it quite well, still having problems with different types of variables, using mean-function does not play well with variables that are not numeric or logical. The original data frame has many different types of variables, such as: numeric, logical, calendar dates and times, character.

Bas · Accepted Answer

I would use the excellent pivot_longer from tidyr for this:

df %>% 
  pivot_longer(starts_with("VAR"),
               names_to = c(".value", "time"),  # creates column "time" and a column for every "VAR"
               names_pattern = "(VAR.*)-time(.*)",  # extracts relevant parts from column names
               values_drop_na = TRUE) # remove NAs

Note that the TIME column can be dropped because this information is extracted from the column names and stored in the time column.

Result:

   Unit TIME  time   VAR1 VAR2   VAR3
       
1     1 time1 1         2 Stuff     4
2     1 time2 2         4 Stuff     2
3     1 time3 3         2 Stuff     4
4     2 time1 1         2 Stuff     4
5     2 time2 2         4 Stuff     2
6     2 time3 3         2 Stuff     4
7     3 time1 1         2 Stuff     4
8     3 time2 2         4 Stuff     2
9     3 time3 3         2 Stuff     4

Tidying messy data - collecting all observations from a particular statistical unit to one row

Answers (1)

Related Questions