llew88
llew88

Reputation: 45

Replacing NAs in a dataframe based on a partial string match (in another dataframe) in R

Goal: To change a column of NAs in one dataframe based on a "key" in another dataframe (something like a VLookUp, except only in R)

Given df1 here (For Simplicity's sake, I just have 6 rows. The key I have is 50 rows for 50 states):

Index State_Name Abbreviation
1 California CA
2 Maryland MD
3 New York NY
4 Texas TX
5 Virginia VA
6 Washington WA

And given df2 here (This is just an example. The real dataframe I'm working with has a lot more rows) :

Index State Article
1 NA Texas governor, Abbott, signs new abortion bill
2 NA Effort to recall California governor Newsome loses steam
3 NA New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4 NA Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5 NA DC statehood unlikely as Manchin opposes
6 NA Amazon HQ2 causing housing prices to soar in northern Virginia

Task: To create an R function that loops and reads the state in each df2$Article row; then cross-reference it with df1$State_Name to replace the NAs in df2$State with the respective df1$Abbreviation key based on the state in df2$Article. I know it's quite a mouthful. I'm stuck with how to start, and finish this puzzle. Hard-coding is not an option as the real datasheet I have have thousands of rows like this, and will update as we add more articles to text-scrape.

The output should look like:

Index State Article
1 TX Texas governor, Abbott, signs new abortion bill
2 CA Effort to recall California governor Newsome loses steam
3 NY New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
4 MD Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
5 NA DC statehood unlikely as Manchin opposes
6 VA Amazon HQ2 causing housing prices to soar in northern Virginia

Note: The fifth entry with DC is intended to be NA.

Any links to guides, and/or any advice on how to code this is most appreciated. Thank you!

Upvotes: 2

Views: 226

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388817

You can create create a regex pattern from the State_Name and use str_extract to extract it from Article. Use match to get the corresponding Abbreviation name from df1.

library(stringr)

df2$State <- df1$Abbreviation[match(str_extract(df2$Article, 
               str_c(df1$State_Name, collapse = '|')), df1$State_Name)]
df2$State
#[1] "TX" "CA" "NY" "MD" NA   "VA"

You can also use inbuilt state.name and state.abb instead of df1 to get state name and abbreviations.


Here's a way to do this in for loop -

for(i in seq(nrow(df1))) {
  inds <- grep(df1$State_Name[i], df2$Article)
  if(length(inds)) df2$State[inds] <- df1$Abbreviation[i]
}
df2

#  Index State                                                                      Article
#1     1    TX                              Texas governor, Abbott, signs new abortion bill
#2     2    CA                     Effort to recall California governor Newsome loses steam
#3     3    NY New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
#4     4    MD     Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
#5     5  <NA>                                     DC statehood unlikely as Manchin opposes
#6     6    VA               Amazon HQ2 causing housing prices to soar in northern Virginia

Upvotes: 4

hello_friend
hello_friend

Reputation: 5788

Not as concise as above but a Base R approach:

# Unlist handling 0 length vectors: list_2_vec => function()
list_2_vec <- function(lst){
  # Coerce 0 length vectors to na values of the appropriate type: 
  # .zero_to_nas => function()
  .zero_to_nas <- function(x){
    if(identical(x, character(0))){
      NA_character_ 
    }else if(identical(x, integer(0))){
      NA_integer_
    }else if(identical(x, numeric(0))){
      NA_real_
    }else if(identical(x, complex(0))){
      NA_complex_
    }else if(identical(x, logical(0))){
      NA
    }else{
      x
    }
  }
  # Unlist cleaned list: res => vector
  res <- unlist(lapply(lst, .zero_to_nas))
  # Explictly define return object: vector => GlobalEnv()
  return(res)
}

# Classify each article as belonging to the appropriate state: 
# clean_df => data.frame
clean_df <- transform(
  df2,
  State = df1$Abbreviation[
    match(
      list_2_vec(
        regmatches(
          Article, 
          gregexpr(
            paste0(df1$State_Name, collapse = "|"), Article
          )
        )
      ),
      df1$State_Name
    )
  ]
)

# Data: 
df1 <- structure(list(Index = 1:6, State_Name = c("California", "Maryland", 
"New York", "Texas", "Virginia", "Washington"), Abbreviation = c("CA", 
"MD", "NY", "TX", "VA", "WA")), class = "data.frame", row.names = c(NA, -6L))

df2 <- structure(list(Index = 1:6, State = c(NA, NA, NA, NA, NA, NA), 
Article = c("Texas governor, Abbott, signs new abortion bill", 
"Effort to recall California governor Newsome loses steam", 
"New York governor, Cuomo, accused of manipulating Covid-19 nursing home data", 
"Hogan (Maryland, R) announces plans to lift statewide Covid restrictions", 
"DC statehood unlikely as Manchin opposes", "Amazon HQ2 causing housing prices to soar in northern Virginia"
)), class = "data.frame", row.names = c(NA, -6L))

Upvotes: 1

Related Questions