Kevin Sun
Kevin Sun

Reputation: 1211

How to fill in NAs of various columns grouped by duplicated IDs in R

I have a table with columns id, colA, and colB. The data contains duplicated id columns where for some rows, colA or colB is null, but its duplicated id has valid values. I want to clean the data so that I remove duplicates, but have complete data. For example my data looks like

id | colA | colB
 1     NA    X
 1      Y    X
 2      Z    NA
 2      Z    Y
 3      Z    Y
 3      Z    Y
 4     NA    NA
 4     NA    NA

and I want my dataframe to look like

id | colA | colB
1     Y      X
2     Z      Y
3     Z      Y
4     NA     NA

I usually use the ifelse statement to replace missing values, but I am confused on how to use this in the context of having duplicated ids.

Upvotes: 0

Views: 52

Answers (3)

Shree
Shree

Reputation: 11140

First add a column that tells how many NAs in each row. Then using dplyr, remove duplicated rows first and then for each id keep the row with least missing values -

df$test <- rowSums(is.na(df))

df %>% 
  filter(!duplicated(.)) %>% 
  arrange(id, test) %>% 
  group_by(id) %>% 
  filter(row_number() == 1) %>% 
  ungroup() %>% 
  select(-test)

# A tibble: 4 x 3
     id colA  colB 
  <int> <chr> <chr>
1     1 y     x    
2     2 z     y    
3     3 z     y    
4     4 <NA>  <NA> 

EDIT: Actually no need to remove duplicates first. Just keeping the row with least missing values for each id should also work -

df$test <- rowSums(is.na(df))

df %>% 
  arrange(id, test) %>% 
  group_by(id) %>% 
  filter(row_number() == 1) %>% 
  ungroup() %>% 
  select(-test)

Data -

df <- data.frame(
  id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA), 
  colB = c("x", "x", NA, "y", "y", "y", NA, NA), stringsAsFactors = F)

Upvotes: 4

neilfws
neilfws

Reputation: 33772

This answer is very dependent on your actual data being similar in structure to your example data.

Your data:

df1 <- structure(list(id = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), 
                      colA = c(NA, "Y", "Z", "Z", "Z", "Z", NA, NA), 
                      colB = c("X", "X", NA, "Y", "Y", "Y", NA, NA)), 
                      class = "data.frame", 
                      row.names = c(NA, -8L))

Assuming, as in your example, that each id occurs twice and that where one observation is NA, it is the first observation for that id, then this works:

library(dplyr)
library(tidyr)

df1 %>% 
  group_by(id) %>% 
  fill(colA, colB, .direction = "up") %>%  
  ungroup() %>% 
  distinct()

# A tibble: 4 x 3
     id colA  colB 
  <int> <chr> <chr>
1     1 Y     X    
2     2 Z     Y    
3     3 Z     Y    
4     4 NA    NA 

If the second observation for an id can be NA, you could try adding a second fill after the first one, but this time fill down:

df1 %>% 
  group_by(id) %>% 
  fill(colA, colB, .direction = "up") %>%  
  fill(colA, colB, .direction = "down") %>%  
  ungroup() %>% 
  distinct()

Upvotes: 2

Andr&#233;.B
Andr&#233;.B

Reputation: 685

Creating dataframe - it helps if you post the code to make the sample data

df <- data.frame(id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA), colB = c("x", "x", NA, "y", "y", "y", NA, NA))

Removing rows with single NAs

for(i in 1:nrow(df)){

  if(is.na(df[i,]$colA) & !is.na(df[i,]$colB) | !is.na(df[i,]$colA) & is.na(df[i,]$colB)){

df <- df[-i,]

  }
}

Removing remaining duplicates (i.e. double NA rows)

df  <- df[!duplicated(df), ]

Output

df

Probably a more computationally efficient way of doing this but this ought to work.

Upvotes: 1

Related Questions