Reputation: 27
I have quite a large dataset in R where all observations from all time-points are assigned as variables (columns). The problem is that for one particular statistical unit there is also one row assigned to each time-point so the data frame looks following:
Unit TIME VAR1-time1 VAR2-time1 VAR3-time1 VAR1-time2 VAR2-time2 VAR3-time2 VAR1-time3 VAR2-time3 VAR3-time3
1 time1 2.0 Stuff 4.0 NA NA NA NA NA NA
1 time2 NA NA NA 4.0 Stuff 2.0 NA NA NA
1 time3 NA NA NA NA NA NA 2.0 Stuff 4.0
2 time1 2.0 Stuff 4.0 NA NA NA NA NA NA
2 time2 NA NA NA 4.0 Stuff 2.0 NA NA NA
2 time3 NA NA NA NA NA NA 2.0 Stuff 4.0
3 time1 2.0 Stuff 4.0 NA NA NA NA NA NA
3 time2 NA NA NA 4.0 Stuff 2.0 NA NA NA
3 time3 NA NA NA NA NA NA 2.0 Stuff 4.0
Data in the rows are stored in correct columns but TIME variable in this case is useless since variables (columns) themselves already contain the information about the timepoint. I would like to get rid of multiple rows for each time point and have all the data from one particular unit on one row.
Is there a convinient way to tidy up this kind of data? Maybe with dplyr or tidyR-package? The data frame has thousands of different variables (columns) with hundreds of different timepoints (rows) for each statistical unit.
df %>%
group_by(Unit) %>%
summarise_all(mean,na.rm=T)
Seems to do it quite well, still having problems with different types of variables, using mean-function does not play well with variables that are not numeric or logical. The original data frame has many different types of variables, such as: numeric, logical, calendar dates and times, character.
Upvotes: 0
Views: 79
Reputation: 4658
I would use the excellent pivot_longer
from tidyr
for this:
df %>%
pivot_longer(starts_with("VAR"),
names_to = c(".value", "time"), # creates column "time" and a column for every "VAR"
names_pattern = "(VAR.*)-time(.*)", # extracts relevant parts from column names
values_drop_na = TRUE) # remove NAs
Note that the TIME
column can be dropped because this information is extracted from the column names and stored in the time
column.
Result:
Unit TIME time VAR1 VAR2 VAR3
<dbl> <chr> <chr> <dbl> <chr> <dbl>
1 1 time1 1 2 Stuff 4
2 1 time2 2 4 Stuff 2
3 1 time3 3 2 Stuff 4
4 2 time1 1 2 Stuff 4
5 2 time2 2 4 Stuff 2
6 2 time3 3 2 Stuff 4
7 3 time1 1 2 Stuff 4
8 3 time2 2 4 Stuff 2
9 3 time3 3 2 Stuff 4
Upvotes: 1