Reputation: 1211
I have a table with columns id
, colA
, and colB
. The data contains duplicated id columns where for some rows, colA
or colB
is null, but its duplicated id
has valid values. I want to clean the data so that I remove duplicates, but have complete data. For example my data looks like
id | colA | colB
1 NA X
1 Y X
2 Z NA
2 Z Y
3 Z Y
3 Z Y
4 NA NA
4 NA NA
and I want my dataframe to look like
id | colA | colB
1 Y X
2 Z Y
3 Z Y
4 NA NA
I usually use the ifelse
statement to replace missing values, but I am confused on how to use this in the context of having duplicated id
s.
Upvotes: 0
Views: 52
Reputation: 11140
First add a column that tells how many NA
s in each row. Then using dplyr
, remove duplicated rows first and then for each id keep the row with least missing values -
df$test <- rowSums(is.na(df))
df %>%
filter(!duplicated(.)) %>%
arrange(id, test) %>%
group_by(id) %>%
filter(row_number() == 1) %>%
ungroup() %>%
select(-test)
# A tibble: 4 x 3
id colA colB
<int> <chr> <chr>
1 1 y x
2 2 z y
3 3 z y
4 4 <NA> <NA>
EDIT: Actually no need to remove duplicates first. Just keeping the row with least missing values for each id should also work -
df$test <- rowSums(is.na(df))
df %>%
arrange(id, test) %>%
group_by(id) %>%
filter(row_number() == 1) %>%
ungroup() %>%
select(-test)
Data -
df <- data.frame(
id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA),
colB = c("x", "x", NA, "y", "y", "y", NA, NA), stringsAsFactors = F)
Upvotes: 4
Reputation: 33772
This answer is very dependent on your actual data being similar in structure to your example data.
Your data:
df1 <- structure(list(id = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L),
colA = c(NA, "Y", "Z", "Z", "Z", "Z", NA, NA),
colB = c("X", "X", NA, "Y", "Y", "Y", NA, NA)),
class = "data.frame",
row.names = c(NA, -8L))
Assuming, as in your example, that each id
occurs twice and that where one observation is NA
, it is the first observation for that id
, then this works:
library(dplyr)
library(tidyr)
df1 %>%
group_by(id) %>%
fill(colA, colB, .direction = "up") %>%
ungroup() %>%
distinct()
# A tibble: 4 x 3
id colA colB
<int> <chr> <chr>
1 1 Y X
2 2 Z Y
3 3 Z Y
4 4 NA NA
If the second observation for an id
can be NA
, you could try adding a second fill
after the first one, but this time fill down:
df1 %>%
group_by(id) %>%
fill(colA, colB, .direction = "up") %>%
fill(colA, colB, .direction = "down") %>%
ungroup() %>%
distinct()
Upvotes: 2
Reputation: 685
Creating dataframe - it helps if you post the code to make the sample data
df <- data.frame(id = c(rep(seq(1:4), each =2)), colA = c(NA, "y", "z", "z", "z", "z", NA, NA), colB = c("x", "x", NA, "y", "y", "y", NA, NA))
Removing rows with single NAs
for(i in 1:nrow(df)){
if(is.na(df[i,]$colA) & !is.na(df[i,]$colB) | !is.na(df[i,]$colA) & is.na(df[i,]$colB)){
df <- df[-i,]
}
}
Removing remaining duplicates (i.e. double NA rows)
df <- df[!duplicated(df), ]
Output
df
Probably a more computationally efficient way of doing this but this ought to work.
Upvotes: 1