Reputation: 37
I would like to filter rows that contain 2 or more words located in another column.
I have a dataframe like this:
df <- data.frame(name1 = c("Carlos Lopez Rey", "Monica Naranjo Garcia", "Antonio Perez Reverte", "Alejandro Martinez Amor", "Iñigo Muruzabal"),
name2 = c("Lopez, Carlos", "Monica de Naranjo", "Garcia, Antonio", "Alejandro Martinez de Amor", "Muruzabal, Javier"))
And I would like to create a condition that filters rows that contain 2 or more same words in the first column (name1) and in the second column (name2). The result I would like to have is:
name1 | name2 |
---|---|
Carlos Lopez Rey | Lopez, Carlos |
Monica Naranjo Garcia | Monica de Naranjo |
Alejandro Martinez Amor | Alejandro Martinez de Amor |
* Notice that "Antonio Perez Reverte" and " Iñigo Muruzabal" are not filtered because the first column only matches 1 word with the second column.
Upvotes: 1
Views: 148
Reputation: 78927
Update: Thanks to Martin Gal's valuable suggestion:
Please consider to merge the twomutate
and the filter
into one line:
library(dplyr)
library(stringr)
df %>%
filter(str_detect(name2, str_replace_all(name1," ", "|")))
1. Answer:
We could create a pattern column with str_replace_all
and then flag with str_detect
and then filter
all rows where string from name1
is in name2
:
library(dplyr)
library(stringr)
df %>%
mutate(pattern_name1 = str_replace_all(name1," ", "|")) %>%
mutate(flag = str_detect(name2, pattern_name1)) %>%
filter(flag == TRUE) %>%
select(1,2)
output:
name1 name2
1 Carlos Lopez Rey Lopez, Carlos
2 Monica Naranjo Garcia Monica de Naranjo
3 Antonio Perez Reverte Garcia, Antonio
4 Alejandro Martinez Amor Alejandro Martinez de Amor
5 Iñigo Muruzabal Muruzabal, Javier
Upvotes: 1
Reputation: 388982
Split the string on words, find common words using length(intersect(...))
and select only rows that have at least 2 words in common.
result <- subset(df, mapply(function(x, y) length(intersect(x, y)),
strsplit(name1, ',|\\s+'), strsplit(name2, ',|\\s+')) >= 2)
result
# name1 name2
#1 Carlos Lopez Rey Lopez, Carlos
#2 Monica Naranjo Garcia Monica de Naranjo
#4 Alejandro Martinez Amor Alejandro Martinez de Amor
Upvotes: 3
Reputation: 1089
You can solve with a for-loop
Your dataset:
df <- data.frame(name1 = c("Carlos Lopez Rey", "Monica Naranjo Garcia", "Antonio Perez Reverte", "Alejandro Martinez Amor", "Iñigo Muruzabal"),
name2 = c("Lopez, Carlos", "Monica de Naranjo", "Garcia, Antonio", "Alejandro Martinez de Amor", "Muruzabal, Javier"))
Remove commas and split in spaces:
name1 <- strsplit(gsub(",","",df$name1), ' ')
name2 <- strsplit(gsub(",","",df$name2), ' ')
Now use the loop to find the rows:
rows <- c()
for (i in 1:nrow(df)){
# i = 1
if (sum(name1[[i]] %in% name2[[i]]) > 1){
rows <- append(rows, i)
}
}
df_select <- df[rows,]
Upvotes: 0