beddotcom
beddotcom

Reputation: 507

Extracting only first appearance from a list of patterns in R

I have a list of country names and a dataframe containing one column of text and one column of binary indicators.

MWE:

rm(list=ls())

library(countrycode)
country_list <- countrycode::codelist$country.name.en

Text <- c("This is","a test to", "find country", "names like Algeria", "Albania and Afghanistan","in the data","and return only the","first match in each","string, Algeria and Albania", "not Afghanistan")
df <- as.data.frame(Text)
df$ofInterest <- c(0,0,0,1,1,1,0,0,1,0)

I want to return the first word (and only the first word) in df$Text that matches any element in country_list. In other words, I'm only interested in the very first country name that gets mentioned.

The operation should create a new column in df indicating the matched country name, or NA if no matches from country_list were found, for each row.

To make things faster, I also want to restrict the search to rows where df$ofInterest==1.

In other words, it should return the following:

Text                       ofInterest   Match
This is                     0           NA   
a test to                   0           NA
find country                0           NA
names like Algeria          1           Algeria
Albania and Afghanistan     1           Albania
in the data                 1           NA
and return only the         0           NA
first match in each         0           NA
string, Algeria and Albania 1           Algeria
not Afghanistan             0           NA

My problem is that I don't know how to use regex while also pattern matching from a list. How can I do this in R?

This was as far as I can get. The "xxxxx" is presumably where the country_name list should go.

This is probably a simple problem, but I couldn't find the solution. Thank you for any help!

df$Match <- ifelse(str_extract(df$Text, "(?<=^| )xxxxx.*?(?=$| )") %in% country_list, str_extract(df$Text, "(?<=^| )xxxxx.*?(?=$| )"), NA)

Upvotes: 3

Views: 81

Answers (2)

PaulS
PaulS

Reputation: 25323

Another possible solution, which is based on intersect with country_list, after having split each phrase into separate words, and taking the first element of the intersection:

library(tidyverse)
library(countrycode)

df %>% 
  rowwise %>% 
  mutate(Match = if_else(ofInterest == 1,
   intersect(unlist(str_split(Text,"\\s")), country_list)[1], NA_character_)) %>%
  ungroup

#> # A tibble: 10 × 3
#>    Text                        ofInterest Match  
#>    <chr>                            <dbl> <chr>  
#>  1 This is                              0 <NA>   
#>  2 a test to                            0 <NA>   
#>  3 find country                         0 <NA>   
#>  4 names like Algeria                   1 Algeria
#>  5 Albania and Afghanistan              1 Albania
#>  6 in the data                          1 <NA>   
#>  7 and return only the                  0 <NA>   
#>  8 first match in each                  0 <NA>   
#>  9 string, Algeria and Albania          1 Algeria
#> 10 not Afghanistan                      0 <NA>

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626851

You can use

df$Match <- str_extract(df$Text, paste0("(?i)\\b(", paste(country_list, collapse="|"), ")\\b"))
df <- within(df, Match[ofInterest == '0'] <- NA)
# > df
#                           Text ofInterest   Match
# 1                      This is          0    <NA>
# 2                    a test to          0    <NA>
# 3                 find country          0    <NA>
# 4           names like Algeria          1 Algeria
# 5      Albania and Afghanistan          1 Albania
# 6                  in the data          1    <NA>
# 7          and return only the          0    <NA>
# 8          first match in each          0    <NA>
# 9  string, Algeria and Albania          1 Algeria
# 10             not Afghanistan          0    <NA>

Here, paste0("(?i)\\b(", paste(country_list, collapse="|"), ")\\b") will create a pattern like

  • (?i) - case insensitive matching
  • \b - a word boundary
  • ( - start of a capturing group:
    • paste(country_list, collapse="|") will result in a |-separated list of country names, like Albania|Poland|France etc.
  • ) - end ofthe group
  • \b - word boundary.

The df <- within(df, Match[ofInterest == '0'] <- NA) will revert NA in all Match rows where ofInterest columnn value is 0.

Upvotes: 3

Related Questions