JosvdBerg
JosvdBerg

Reputation: 31

Apply function based on value in other column

I have a dataframe that contains 10 observation per unique ID and some other variables. Unfortunately, the data has some missing values (NA's) for a specific column that I want to replace by carrying the last observation forward (by means of the na.locf function of the zoo package).

The thing is, I need to perform this function per unique user, so I need to "split" the dataframe per unique ID, and then apply the function, while keeping all other columns. I have tried the following, but this did not work:

df %>%
  group_by(ID) %>%
  group_split() %>%
  lapply(Location, na.locf() %>%
  bind_rows

A reproducible example of the dataset:

ID <- rep(1:5, each = 10)
Age <- sample(18:65, 50, replace = TRUE)
Locations <- c("Europe", "Asia", NA)
Location <- sample(Location, 50, replace = TRUE)
df <- data.frame(ID, Age, Location)

Upvotes: 1

Views: 833

Answers (1)

Roland
Roland

Reputation: 397

Using group_by then na.locf, with the na.rm=FALSE option to keep the length of the output:

library(dplyr)
df <- df %>% group_by(ID) %>% 
mutate(Location=zoo::na.locf(Location, na.rm=FALSE))

As Martin commented, if the series starts with a missing value for a given ID, it will not be filled. This script does not fail, even when the series starts with a missing. For instance:

df <- tibble(ID=c(1, 1, 2, 2), Location=c(NA,"Europe","Asia",NA))
df %>% group_by(ID) %>% mutate(Location2=zoo::na.locf(Location, na.rm=FALSE))

returns:

# A tibble: 4 x 3
# Groups:   ID [2]
     ID Location Location2
  <dbl> <chr>    <chr>    
1     1 <NA>     <NA>     
2     1 Europe   Europe   
3     2 Asia     Asia     
4     2 <NA>     Asia     

Upvotes: 1

Related Questions