Alex
Alex

Reputation: 37

How can I filter rows which contains 2 or more words from another column?

I would like to filter rows that contain 2 or more words located in another column.

I have a dataframe like this:

    df <- data.frame(name1 = c("Carlos Lopez Rey", "Monica Naranjo Garcia", "Antonio Perez Reverte", "Alejandro Martinez Amor", "Iñigo Muruzabal"), 
                     name2 = c("Lopez, Carlos", "Monica de Naranjo", "Garcia, Antonio", "Alejandro Martinez de Amor", "Muruzabal, Javier"))

And I would like to create a condition that filters rows that contain 2 or more same words in the first column (name1) and in the second column (name2). The result I would like to have is:

name1 name2
Carlos Lopez Rey Lopez, Carlos
Monica Naranjo Garcia Monica de Naranjo
Alejandro Martinez Amor Alejandro Martinez de Amor

* Notice that "Antonio Perez Reverte" and " Iñigo Muruzabal" are not filtered because the first column only matches 1 word with the second column.

Upvotes: 1

Views: 148

Answers (3)

TarJae
TarJae

Reputation: 78927

Update: Thanks to Martin Gal's valuable suggestion:

Please consider to merge the twomutate and the filter into one line:

library(dplyr)
library(stringr)

df %>%
filter(str_detect(name2, str_replace_all(name1," ", "|")))

1. Answer: We could create a pattern column with str_replace_all and then flag with str_detect and then filter all rows where string from name1 is in name2:

library(dplyr)
library(stringr)
df %>% 
  mutate(pattern_name1 = str_replace_all(name1," ", "|")) %>% 
  mutate(flag = str_detect(name2, pattern_name1)) %>% 
  filter(flag == TRUE) %>% 
  select(1,2)

output:

             name1                      name2
1        Carlos Lopez Rey              Lopez, Carlos
2   Monica Naranjo Garcia          Monica de Naranjo
3   Antonio Perez Reverte            Garcia, Antonio
4 Alejandro Martinez Amor Alejandro Martinez de Amor
5         Iñigo Muruzabal          Muruzabal, Javier

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388982

Split the string on words, find common words using length(intersect(...)) and select only rows that have at least 2 words in common.

result <- subset(df, mapply(function(x, y) length(intersect(x, y)), 
                     strsplit(name1, ',|\\s+'), strsplit(name2, ',|\\s+')) >= 2)

result

#                    name1                      name2
#1        Carlos Lopez Rey              Lopez, Carlos
#2   Monica Naranjo Garcia          Monica de Naranjo
#4 Alejandro Martinez Amor Alejandro Martinez de Amor

Upvotes: 3

Pedro Alencar
Pedro Alencar

Reputation: 1089

You can solve with a for-loop

Your dataset:

df <- data.frame(name1 = c("Carlos Lopez Rey", "Monica Naranjo Garcia", "Antonio Perez Reverte", "Alejandro Martinez Amor", "Iñigo Muruzabal"), 
                 name2 = c("Lopez, Carlos", "Monica de Naranjo", "Garcia, Antonio", "Alejandro Martinez de Amor", "Muruzabal, Javier"))

Remove commas and split in spaces:

name1 <- strsplit(gsub(",","",df$name1),  ' ')
name2 <- strsplit(gsub(",","",df$name2), ' ')

Now use the loop to find the rows:

rows <- c()
for (i in 1:nrow(df)){
    # i = 1
    if (sum(name1[[i]] %in% name2[[i]]) > 1){
        rows <- append(rows, i)
    }
}

df_select <- df[rows,]

Upvotes: 0

Related Questions