tadeufontes
tadeufontes

Reputation: 477

How to remove rows with NAs in a specific column but only if they have a duplicate in another column?

So I have this data frame, in which some rows are duplicated in the "Species" column, and some aren't. There are NAs in all types of rows, both duplicated and not duplicated.

Species            |  A  |  B  |  
--------------------------------
Tilapia guineensis |  1  |  10 |  
Tilapia guineensis |  1  |  NA |   
Tilapia zillii     |  3  |  23 |  
Tilapia zillii     |  3  |  NA |  
Eutrigla gurnardus | 18  |  4  |
Caramila artida    |  9  |  NA |  
Sprattus sprattus  |  7  |  6  |
Spalili burcant    | 11  |  NA |

I want to remove that rows that have NAs in the B column, but only if they belong to a duplicate row. My output would be something like this:

Species            |  A  |  B  |  
--------------------------------
Tilapia guineensis |  1  |  10 |    
Tilapia zillii     |  3  |  23 |    
Eutrigla gurnardus | 18  |  4  |
Caramila artida    |  9  |  NA |  
Sprattus sprattus  |  7  |  6  |
Spalili burcant    | 11  |  NA |

Basically, if a row is duplicated in the Species column and has a NA in the B column, I want to remove that row with the NA. However, if a row is unique in the Species column, I want to keep it even if it has NA.

Sorry if I'm being confusing, thanks in advance.

Reproducible data format:

df <- read.csv(text="
Species,A,B
Tilapia guineensis,1,10
Tilapia guineensis,1,NA
Tilapia zillii,3,23
Tilapia zillii,3,NA
Eutrigla gurnardus,18,4
Caramila artida,9,NA
Sprattus sprattus,7,6
Spalili burcant,11,NA")

Upvotes: 3

Views: 175

Answers (2)

thelatemail
thelatemail

Reputation: 93813

I think you can avoid using any grouping logic, and get the result in a single pass:

df[!(duplicated(df$Species) & is.na(df$B)),]
#             Species  A  B
#1 Tilapia guineensis  1 10
#3     Tilapia zillii  3 23
#5 Eutrigla gurnardus 18  4
#6    Caramila artida  9 NA
#7  Sprattus sprattus  7  6
#8    Spalili burcant 11 NA

Upvotes: 6

Ronak Shah
Ronak Shah

Reputation: 388817

You can select a row if there is only 1 row for the Species or if it is not an NA value.

library(dplyr)
df %>% group_by(Species) %>% filter(n() == 1 | !is.na(B))

#  Species                A     B
#  <chr>              <int> <int>
#1 Tilapia guineensis     1    10
#2 Tilapia zillii         3    23
#3 Eutrigla gurnardus    18     4
#4 Caramila artida        9    NA
#5 Sprattus sprattus      7     6
#6 Spalili burcant       11    NA

You can also write the same logic in base R and data.table :

#Base R
subset(df, ave(!is.na(B), Species, FUN = function(x) length(x) == 1 | x))

#data.table
library(data.table)
setDT(df)[, .SD[.N == 1 | !is.na(B)], Species]

data

df <- structure(list(Species = c("Tilapia guineensis", "Tilapia guineensis", 
"Tilapia zillii", "Tilapia zillii", "Eutrigla gurnardus", "Caramila artida", 
"Sprattus sprattus", "Spalili burcant"), A = c(1L, 1L, 3L, 3L, 
18L, 9L, 7L, 11L), B = c(10L, NA, 23L, NA, 4L, NA, 6L, NA)), row.names = c(NA, 
-8L), class = "data.frame")

Upvotes: 5

Related Questions