How to remove rows with NAs in a specific column but only if they have a duplicate in another column?

Question

So I have this data frame, in which some rows are duplicated in the "Species" column, and some aren't. There are NAs in all types of rows, both duplicated and not duplicated.

Species            |  A  |  B  |  
--------------------------------
Tilapia guineensis |  1  |  10 |  
Tilapia guineensis |  1  |  NA |   
Tilapia zillii     |  3  |  23 |  
Tilapia zillii     |  3  |  NA |  
Eutrigla gurnardus | 18  |  4  |
Caramila artida    |  9  |  NA |  
Sprattus sprattus  |  7  |  6  |
Spalili burcant    | 11  |  NA |

I want to remove that rows that have NAs in the B column, but only if they belong to a duplicate row. My output would be something like this:

Species            |  A  |  B  |  
--------------------------------
Tilapia guineensis |  1  |  10 |    
Tilapia zillii     |  3  |  23 |    
Eutrigla gurnardus | 18  |  4  |
Caramila artida    |  9  |  NA |  
Sprattus sprattus  |  7  |  6  |
Spalili burcant    | 11  |  NA |

Basically, if a row is duplicated in the Species column and has a NA in the B column, I want to remove that row with the NA. However, if a row is unique in the Species column, I want to keep it even if it has NA.

Sorry if I'm being confusing, thanks in advance.

Reproducible data format:

df <- read.csv(text="
Species,A,B
Tilapia guineensis,1,10
Tilapia guineensis,1,NA
Tilapia zillii,3,23
Tilapia zillii,3,NA
Eutrigla gurnardus,18,4
Caramila artida,9,NA
Sprattus sprattus,7,6
Spalili burcant,11,NA")

Ronak Shah · Accepted Answer

You can select a row if there is only 1 row for the Species or if it is not an NA value.

library(dplyr)
df %>% group_by(Species) %>% filter(n() == 1 | !is.na(B))

#  Species                A     B
#                 
#1 Tilapia guineensis     1    10
#2 Tilapia zillii         3    23
#3 Eutrigla gurnardus    18     4
#4 Caramila artida        9    NA
#5 Sprattus sprattus      7     6
#6 Spalili burcant       11    NA

You can also write the same logic in base R and data.table :

#Base R
subset(df, ave(!is.na(B), Species, FUN = function(x) length(x) == 1 | x))

#data.table
library(data.table)
setDT(df)[, .SD[.N == 1 | !is.na(B)], Species]

data

df <- structure(list(Species = c("Tilapia guineensis", "Tilapia guineensis", 
"Tilapia zillii", "Tilapia zillii", "Eutrigla gurnardus", "Caramila artida", 
"Sprattus sprattus", "Spalili burcant"), A = c(1L, 1L, 3L, 3L, 
18L, 9L, 7L, 11L), B = c(10L, NA, 23L, NA, 4L, NA, 6L, NA)), row.names = c(NA, 
-8L), class = "data.frame")

How to remove rows with NAs in a specific column but only if they have a duplicate in another column?

Answers (2)

Related Questions