jvalenti
jvalenti

Reputation: 640

R - conditional pattern matching using grepl

I have two data frames, like so:

name <- c("joe", "kim", "kerry", "david")
name2 <- c("kim", "david", "joe", "kerry")
school <- c("cambridge", "south carolina", "vermont binghamton", "delaware")
school2 <- c("south carolina", "delaware", "cambridge magdalene", "vermont")

df1 <- data.frame(name, school)
df2 <- data.frame(name2, school2)

What I would like to do is the following:

  1. Search df2$name2 for a match in df1$name.
  2. If a match is found, compare df2$school2 to df1$school from the matching row.
  3. If no match is found for df2$school2 in df1$school, return FALSE in column df2$perfect.match

So for example, since "joe" in df2 matches "joe" in df1, there's a match. However, since the values for "school" in both aren't the same, the would be a column in df2 with a value of FALSE in the third row. Same for 4th row in df2.

I have tried using grep and grepl. I figure grepl would be best, since it returns a logical value. What I tried was:

df2$perfect.match <- ifelse(grepl(paste(df2$name2, collapse = "|"), 
df1$name, fixed = F) & grepl(paste(df2$school2, collapse = "|"), df1$school, fixed = F), "", "FALSE")

however, all I get is this:

  name2             school2 perfect.match
1   kim      south carolina         FALSE
2 david            delaware              
3   joe cambridge magdalene              
4 kerry             vermont 

When my desired result is:

  df2

  name2             school2 perfect.match
1   kim      south carolina         
2 david            delaware              
3   joe cambridge magdalene         FALSE     
4 kerry             vermont         FALSE

If possible, something speedy would be best. The real dataframe are quite large. Thanks.

UPDATE:

I would like to also be able to force the rows that are false to have the same value for df2$school as their corresponding name match in df1$school Like so:

  name2             school2
1   kim      south carolina
2 david            delaware
3   joe           cambridge 
4 kerry   vermont binghamton

Upvotes: 1

Views: 479

Answers (4)

Ronak Shah
Ronak Shah

Reputation: 388862

We can use match and %in%. grepl wouldn't be right here since this is exact matching and not pattern matching.

df2$perfect_match <- df2$school2 %in% df1$school[match(df2$name2, df1$name)]
df2
#  name2             school2 perfect_match
#1   kim      south carolina          TRUE
#2 david            delaware          TRUE
#3   joe cambridge magdalene         FALSE
#4 kerry             vermont         FALSE

Upvotes: 2

dc37
dc37

Reputation: 16178

Using dplyr, you can do:

dfX <- df1 %>%
  bind_rows(.,df2) %>%
  group_by(name) %>%
  distinct(school) %>%
  count(name, name = "perfect.matched") %>% 
  left_join(df2,.,by = 'name') %>%
  mutate(., perfect.matched = ifelse(perfect.matched ==1,"","FALSE"))

And to get the following output:

> dfX
   name              school perfect.matched
1   kim      south carolina                
2 david            delaware                
3   joe cambridge magdalene           FALSE
4 kerry             vermont           FALSE

Upvotes: 2

mnist
mnist

Reputation: 6956

Slightly faster than pasting the columns together:

matches <- df2$name2 %in% df1$name
df2$perfect.match <- df2$school2[matches] %in% df1$school

microbenchmark::microbenchmark(
  v1 = {matches <- df2$name2 %in% df1$name
  df2$perfect.match <- df2$school2[matches] %in% df1$school
  },
  v2 = {df2$perfect.match <- paste(df2$name2, df2$school2) %in% paste(df1$name, df1$school)}
)

Upvotes: 2

Andrew Gustar
Andrew Gustar

Reputation: 18425

You can just do...

df2$perfect.match <- paste(df2$name2, df2$school2) %in% paste(df1$name, df1$school)

df2
  name2             school2 perfect.match
1   kim      south carolina          TRUE
2 david            delaware          TRUE
3   joe cambridge magdalene         FALSE
4 kerry             vermont         FALSE

Upvotes: 3

Related Questions