Reputation: 507
I have a list of country names and a dataframe containing one column of text and one column of binary indicators.
MWE:
rm(list=ls())
library(countrycode)
country_list <- countrycode::codelist$country.name.en
Text <- c("This is","a test to", "find country", "names like Algeria", "Albania and Afghanistan","in the data","and return only the","first match in each","string, Algeria and Albania", "not Afghanistan")
df <- as.data.frame(Text)
df$ofInterest <- c(0,0,0,1,1,1,0,0,1,0)
I want to return the first word (and only the first word) in df$Text
that matches any element in country_list
. In other words, I'm only interested in the very first country name that gets mentioned.
The operation should create a new column in df
indicating the matched country name, or NA if no matches from country_list
were found, for each row.
To make things faster, I also want to restrict the search to rows where df$ofInterest==1
.
In other words, it should return the following:
Text ofInterest Match
This is 0 NA
a test to 0 NA
find country 0 NA
names like Algeria 1 Algeria
Albania and Afghanistan 1 Albania
in the data 1 NA
and return only the 0 NA
first match in each 0 NA
string, Algeria and Albania 1 Algeria
not Afghanistan 0 NA
My problem is that I don't know how to use regex while also pattern matching from a list. How can I do this in R?
This was as far as I can get. The "xxxxx" is presumably where the country_name
list should go.
This is probably a simple problem, but I couldn't find the solution. Thank you for any help!
df$Match <- ifelse(str_extract(df$Text, "(?<=^| )xxxxx.*?(?=$| )") %in% country_list, str_extract(df$Text, "(?<=^| )xxxxx.*?(?=$| )"), NA)
Upvotes: 3
Views: 81
Reputation: 25323
Another possible solution, which is based on intersect
with country_list
, after having split each phrase into separate words, and taking the first element of the intersection:
library(tidyverse)
library(countrycode)
df %>%
rowwise %>%
mutate(Match = if_else(ofInterest == 1,
intersect(unlist(str_split(Text,"\\s")), country_list)[1], NA_character_)) %>%
ungroup
#> # A tibble: 10 × 3
#> Text ofInterest Match
#> <chr> <dbl> <chr>
#> 1 This is 0 <NA>
#> 2 a test to 0 <NA>
#> 3 find country 0 <NA>
#> 4 names like Algeria 1 Algeria
#> 5 Albania and Afghanistan 1 Albania
#> 6 in the data 1 <NA>
#> 7 and return only the 0 <NA>
#> 8 first match in each 0 <NA>
#> 9 string, Algeria and Albania 1 Algeria
#> 10 not Afghanistan 0 <NA>
Upvotes: 1
Reputation: 626851
You can use
df$Match <- str_extract(df$Text, paste0("(?i)\\b(", paste(country_list, collapse="|"), ")\\b"))
df <- within(df, Match[ofInterest == '0'] <- NA)
# > df
# Text ofInterest Match
# 1 This is 0 <NA>
# 2 a test to 0 <NA>
# 3 find country 0 <NA>
# 4 names like Algeria 1 Algeria
# 5 Albania and Afghanistan 1 Albania
# 6 in the data 1 <NA>
# 7 and return only the 0 <NA>
# 8 first match in each 0 <NA>
# 9 string, Algeria and Albania 1 Algeria
# 10 not Afghanistan 0 <NA>
Here, paste0("(?i)\\b(", paste(country_list, collapse="|"), ")\\b")
will create a pattern like
(?i)
- case insensitive matching\b
- a word boundary(
- start of a capturing group:
paste(country_list, collapse="|")
will result in a |
-separated list of country names, like Albania|Poland|France
etc.)
- end ofthe group\b
- word boundary.The df <- within(df, Match[ofInterest == '0'] <- NA)
will revert NA
in all Match
rows where ofInterest
columnn value is 0
.
Upvotes: 3