Replacing NAs in a dataframe based on a partial string match (in another dataframe) in R

Question

Goal: To change a column of NAs in one dataframe based on a "key" in another dataframe (something like a VLookUp, except only in R)

Given df1 here (For Simplicity's sake, I just have 6 rows. The key I have is 50 rows for 50 states):

Index	State_Name	Abbreviation
1	California	CA
2	Maryland	MD
3	New York	NY
4	Texas	TX
5	Virginia	VA
6	Washington	WA

And given df2 here (This is just an example. The real dataframe I'm working with has a lot more rows) :

Index	State	Article
1	NA	Texas governor, Abbott, signs new abortion bill
2	NA	Effort to recall California governor Newsome loses steam
3	NA	New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4	NA	Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5	NA	DC statehood unlikely as Manchin opposes
6	NA	Amazon HQ2 causing housing prices to soar in northern Virginia

Task: To create an R function that loops and reads the state in each df2$Article row; then cross-reference it with df1$State_Name to replace the NAs in df2$State with the respective df1$Abbreviation key based on the state in df2$Article. I know it's quite a mouthful. I'm stuck with how to start, and finish this puzzle. Hard-coding is not an option as the real datasheet I have have thousands of rows like this, and will update as we add more articles to text-scrape.

The output should look like:

Index	State	Article
1	TX	Texas governor, Abbott, signs new abortion bill
2	CA	Effort to recall California governor Newsome loses steam
3	NY	New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4	MD	Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5	NA	DC statehood unlikely as Manchin opposes
6	VA	Amazon HQ2 causing housing prices to soar in northern Virginia

Note: The fifth entry with DC is intended to be NA.

Any links to guides, and/or any advice on how to code this is most appreciated. Thank you!

Ronak Shah · Accepted Answer

You can create create a regex pattern from the State_Name and use str_extract to extract it from Article. Use match to get the corresponding Abbreviation name from df1.

library(stringr)

df2$State <- df1$Abbreviation[match(str_extract(df2$Article, 
               str_c(df1$State_Name, collapse = '|')), df1$State_Name)]
df2$State
#[1] "TX" "CA" "NY" "MD" NA   "VA"

You can also use inbuilt state.name and state.abb instead of df1 to get state name and abbreviations.

Here's a way to do this in for loop -

for(i in seq(nrow(df1))) {
  inds <- grep(df1$State_Name[i], df2$Article)
  if(length(inds)) df2$State[inds] <- df1$Abbreviation[i]
}
df2

#  Index State                                                                      Article
#1     1    TX                              Texas governor, Abbott, signs new abortion bill
#2     2    CA                     Effort to recall California governor Newsome loses steam
#3     3    NY New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
#4     4    MD     Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
#5     5                                       DC statehood unlikely as Manchin opposes
#6     6    VA               Amazon HQ2 causing housing prices to soar in northern Virginia

Replacing NAs in a dataframe based on a partial string match (in another dataframe) in R

Answers (2)

Related Questions