Elsybelsy
Elsybelsy

Reputation: 71

Removing rows from a dataframe based on (exclusive) partial match to another dataframe

I have one dataframe A that contains 2 columns - strings and codes:

String             Code
Bacteria.111       abc1
Bacteria.111       abc2
Bacteria.111       abc3
Archaea.153        gmt1
Archaea.153        gmt2

I have another dataframe B which contains a list of what code the number following bacteria/archaea above should be:

Number             Code
111                abc2
153                gmt1

What i'm trying to do is remove from dataframe A the rows which have the incorrect code for the number after bacteria/archaea. So e.g. the resulting dataframe from this should read:

String             Code
Bacteria.111       abc2
Archaea.153        gmt1

I've tried using %in% but this uses an exact match so nothing matches. All I need is a match for the number which is always at the end of the string and always follows a period. I'm lost on how to accomplish this.

Upvotes: 1

Views: 129

Answers (2)

GKi
GKi

Reputation: 39747

You can use %in% when you paste the columns and use sub to remove everything before the ..

A[paste(sub(".*\\.", "", A$String), A$Code) %in% paste(B$Number, B$Code),]
#        String Code
#2 Bacteria.111 abc2
#4  Archaea.153 gmt1

Upvotes: 2

xwhitelight
xwhitelight

Reputation: 1579

Try this:

A <- A %>%
  mutate(
    Number = as.integer(stringr::str_sub(String, start = stringi::stri_locate_last_fixed(String, ".")[,2] + 1)) # get the code
  ) %>% 
  left_join(B %>% rename(Code2 = Code), by = "Number") %>% # join data frames
  filter(Code == Code2) %>% # compare and filter
  select(-Number, -Code2)

Upvotes: 1

Related Questions