Reputation: 399
I am trying to remove duplicates from a dataset (caused by merging). However, one row contains a value and one does not, in some cases both rows are NA. I want to keep the ones with data, and if there are on NAs, then it does not matter which I keep. How do I do that? I am stuck.
I unsuccessfully tried the solutions from here (also not usually working with data.table, so I dont understand whats what)
R data.table remove rows where one column is duplicated if another column is NA
Some minimum example data:
df <- data.frame(ID = c("A", "A", "B", "B", "C", "D", "E", "G", "H", "J", "J"),
value = c(NA, 1L, NA, NA, 1L, 1L, 1L, 1L, 1L, NA, 1L))
ID value
A NA
A 1
B NA
B NA
C 1
D 1
E 1
G 1
H 1
J NA
J 1
and I want this:
ID value
A 1
B NA
C 1
D 1
E 1
G 1
H 1
J 1
Upvotes: 4
Views: 2102
Reputation: 33488
Here is a relatively simple data.table
solution.
Grouping by ID
if all the values are NA
just take the first value
, if not take all values that are not NA
.
library(data.table)
setDT(df)
df[, if (all(is.na(value))) value[1] else value[!is.na(value)], by = ID]
Upvotes: 4
Reputation: 35554
An alternative of @tmfmnk's answer with slice_max()
in dplyr
.
library(dplyr)
df %>%
group_by(ID) %>%
slice_max(!is.na(value), with_ties = F)
# # A tibble: 8 x 2
# # Groups: ID [8]
# ID value
# <chr> <int>
# 1 A 1
# 2 B NA
# 3 C 1
# 4 D 1
# 5 E 1
# 6 G 1
# 7 H 1
# 8 J 1
Upvotes: 3
Reputation: 39858
One possibility using dplyr
could be:
df %>%
group_by(ID) %>%
slice(which.max(!is.na(value)))
ID value
<chr> <int>
1 A 1
2 B NA
3 C 1
4 D 1
5 E 1
6 G 1
7 H 1
8 J 1
Upvotes: 8