mkpcr
mkpcr

Reputation: 511

R: Replace missing values in one row based on unique ID for all variables in dataframe

I'm working with a "long" dataframe in which unique participants have multiple observations over time. Some rows are missing data across multiple columns, and those missing records need to be filled with the same data for each participant.

My data set looks like:

list(
ID = c("A23", "A23", "A24", "A24", "A24", "A50", "A50", "A60", "A60"), 
A = c("Y", "Y", "N", "N", NA, "Y", NA, "N", "N"), 
B = c(2.4, 2.4, 2.6, 2.6, NA, 2.2, NA, 3.3, 3.3), 
C = c("M", "M", "F", "F", NA, "F", NA, "M", "M"), 
D = c(1, 1, 0, 0, NA, 1, NA, 1, 1), 
E = c("2015-02-16", "2017-10-05", "2014-03-26", "2016-09-07", "2018-10-08", "2015-02-17", "2017-10-06", "2014-03-27", "2016-09-08"))

I would like to be able to simply copy the data from the complete/non-missing rows and paste it into the missing records based on the ID variable, across the entire dataframe.

I have tried the following methods to no avail (no missing values replaced):

library(tidyverse)
library(zoo)
library(plyr)

# Attempt 1:
data %>%
  group_by(ID) %>%
  mutate_all(~replace(., is.na(.), .[!is.na(.)][1]))

# Attempt 2:
plyr::ddply(data, ~ID, na.locf)

I've found solutions for cases in which missing values of a single column need to be replaced, but I have not yet found a solution for replacing missing values across an entire dataframe.

Thanks in advance.

Upvotes: 1

Views: 826

Answers (2)

Mitchell Graham
Mitchell Graham

Reputation: 155

fill() from tidyr should work:

df %>%
  group_by(ID) %>%
  fill(names(df))

# A tibble: 9 x 6
# Groups:   ID [4]
  ID    A         B C         D E         
  <chr> <chr> <dbl> <chr> <dbl> <chr>     
1 A23   Y       2.4 M         1 2015-02-16
2 A23   Y       2.4 M         1 2017-10-05
3 A24   N       2.6 F         0 2014-03-26
4 A24   N       2.6 F         0 2016-09-07
5 A24   N       2.6 F         0 2018-10-08
6 A50   Y       2.2 F         1 2015-02-17
7 A50   Y       2.2 F         1 2017-10-06
8 A60   N       3.3 M         1 2014-03-27
9 A60   N       3.3 M         1 2016-09-08

Upvotes: 1

Robert Wilson
Robert Wilson

Reputation: 3417

You can solve this using the group_by/mutate method in dplyr:

library(dplyr)
df <- tibble(
  ID = c("A23", "A23", "A24", "A24", "A24", "A50", "A50", "A60", "A60"), 
  A = c("Y", "Y", "N", "N", NA, "Y", NA, "N", "N"), 
  B = c(2.4, 2.4, 2.6, 2.6, NA, 2.2, NA, 3.3, 3.3), 
  C = c("M", "M", "F", "F", NA, "F", NA, "M", "M"), 
  D = c(1, 1, 0, 0, NA, 1, NA, 1, 1), 
  E = c("2015-02-16", "2017-10-05", "2014-03-26", "2016-09-07", "2018-10-08", 
 "2015-02-17", "2017-10-06", "2014-03-27", "2016-09-08"))



fix_nas <- function(x){
  x[complete.cases(x)][1]

}

df %>% 
  group_by(ID) %>% 
  mutate_at(2:5, fix_nas) 

Upvotes: 1

Related Questions