Rami Al-Fahham
Rami Al-Fahham

Reputation: 627

Rowwise extract common substrings from to columns in a data frame

I want to match cities with regions in a data frame. The columns are a little bit messy, so I would like to extract the names of the cities / regions that appear in two columns as in the following example.

A <- c("Berlin",
            "Hamburg",
            "Munich",
            "Stuttgart",
            "Rhein Main Frankfurt",
            "Hannover")

B <- c("Berlin Brandenburg",
             "Hamburg",
             "Munich Bayern",
             "Region Stuttgart",
             "Main Rhein Darmstadt",
             "Wiesbaden")

The resulting column / data frame should look like this:

result <- c("Berlin",
            "Hamburg",
            "Munich",
            "Stuttgart",
            "Rhein Main",
            NA
            )

df <- data.frame(A, B, result)

...while it doesn't matter whether it's "Rhein Main" or "Main Rhein".

Thank you for your help!

Upvotes: 0

Views: 101

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 389235

Maybe I am missing a smart regex trick but one option would be to split strings into words and find the common words using intersect.

df$Result <- mapply(function(x, y) paste0(intersect(x, y), collapse = " "), 
                    strsplit(df$A, '\\s+'), strsplit(df$B, '\\s+'))
df

#                     A                    B     Result
#1               Berlin   Berlin Brandenburg     Berlin
#2              Hamburg              Hamburg    Hamburg
#3               Munich        Munich Bayern     Munich
#4            Stuttgart     Region Stuttgart  Stuttgart
#5 Rhein Main Frankfurt Main Rhein Darmstadt Rhein Main
#6             Hannover            Wiesbaden           

This returns empty string when there is no match. You can turn the empty string to NA if needed.

Upvotes: 1

Related Questions