Reputation: 31
I have a dataframe that contains 10 observation per unique ID and some other variables. Unfortunately, the data has some missing values (NA's) for a specific column that I want to replace by carrying the last observation forward (by means of the na.locf function of the zoo package).
The thing is, I need to perform this function per unique user, so I need to "split" the dataframe per unique ID, and then apply the function, while keeping all other columns. I have tried the following, but this did not work:
df %>%
group_by(ID) %>%
group_split() %>%
lapply(Location, na.locf() %>%
bind_rows
A reproducible example of the dataset:
ID <- rep(1:5, each = 10)
Age <- sample(18:65, 50, replace = TRUE)
Locations <- c("Europe", "Asia", NA)
Location <- sample(Location, 50, replace = TRUE)
df <- data.frame(ID, Age, Location)
Upvotes: 1
Views: 833
Reputation: 397
Using group_by
then na.locf
, with the na.rm=FALSE
option to keep the length of the output:
library(dplyr)
df <- df %>% group_by(ID) %>%
mutate(Location=zoo::na.locf(Location, na.rm=FALSE))
As Martin commented, if the series starts with a missing value for a given ID, it will not be filled. This script does not fail, even when the series starts with a missing. For instance:
df <- tibble(ID=c(1, 1, 2, 2), Location=c(NA,"Europe","Asia",NA))
df %>% group_by(ID) %>% mutate(Location2=zoo::na.locf(Location, na.rm=FALSE))
returns:
# A tibble: 4 x 3
# Groups: ID [2]
ID Location Location2
<dbl> <chr> <chr>
1 1 <NA> <NA>
2 1 Europe Europe
3 2 Asia Asia
4 2 <NA> Asia
Upvotes: 1