Reputation: 45
Goal: To change a column of NAs in one dataframe based on a "key" in another dataframe (something like a VLookUp, except only in R)
Given df1 here (For Simplicity's sake, I just have 6 rows. The key I have is 50 rows for 50 states):
Index | State_Name | Abbreviation |
---|---|---|
1 | California | CA |
2 | Maryland | MD |
3 | New York | NY |
4 | Texas | TX |
5 | Virginia | VA |
6 | Washington | WA |
And given df2 here (This is just an example. The real dataframe I'm working with has a lot more rows) :
Index | State | Article |
---|---|---|
1 | NA | Texas governor, Abbott, signs new abortion bill |
2 | NA | Effort to recall California governor Newsome loses steam |
3 | NA | New York governor, Cuomo, accused of manipulating Covid-19 nursing home data |
4 | NA | Hogan (Maryland, R) announces plans to lift statewide Covid restrictions |
5 | NA | DC statehood unlikely as Manchin opposes |
6 | NA | Amazon HQ2 causing housing prices to soar in northern Virginia |
Task: To create an R function that loops and reads the state in each df2$Article row; then cross-reference it with df1$State_Name to replace the NAs in df2$State with the respective df1$Abbreviation key based on the state in df2$Article. I know it's quite a mouthful. I'm stuck with how to start, and finish this puzzle. Hard-coding is not an option as the real datasheet I have have thousands of rows like this, and will update as we add more articles to text-scrape.
The output should look like:
Index | State | Article |
---|---|---|
1 | TX | Texas governor, Abbott, signs new abortion bill |
2 | CA | Effort to recall California governor Newsome loses steam |
3 | NY | New York governor, Cuomo, accused of manipulating Covid-19 nursing home data |
4 | MD | Hogan (Maryland, R) announces plans to lift statewide Covid restrictions |
5 | NA | DC statehood unlikely as Manchin opposes |
6 | VA | Amazon HQ2 causing housing prices to soar in northern Virginia |
Note: The fifth entry with DC is intended to be NA.
Any links to guides, and/or any advice on how to code this is most appreciated. Thank you!
Upvotes: 2
Views: 226
Reputation: 388817
You can create create a regex pattern from the State_Name
and use str_extract
to extract it from Article
. Use match
to get the corresponding Abbreviation
name from df1
.
library(stringr)
df2$State <- df1$Abbreviation[match(str_extract(df2$Article,
str_c(df1$State_Name, collapse = '|')), df1$State_Name)]
df2$State
#[1] "TX" "CA" "NY" "MD" NA "VA"
You can also use inbuilt state.name
and state.abb
instead of df1
to get state name and abbreviations.
Here's a way to do this in for
loop -
for(i in seq(nrow(df1))) {
inds <- grep(df1$State_Name[i], df2$Article)
if(length(inds)) df2$State[inds] <- df1$Abbreviation[i]
}
df2
# Index State Article
#1 1 TX Texas governor, Abbott, signs new abortion bill
#2 2 CA Effort to recall California governor Newsome loses steam
#3 3 NY New York governor, Cuomo, accused of manipulating Covid-19 nursing home data
#4 4 MD Hogan (Maryland, R) announces plans to lift statewide Covid restrictions
#5 5 <NA> DC statehood unlikely as Manchin opposes
#6 6 VA Amazon HQ2 causing housing prices to soar in northern Virginia
Upvotes: 4
Reputation: 5788
Not as concise as above but a Base R approach:
# Unlist handling 0 length vectors: list_2_vec => function()
list_2_vec <- function(lst){
# Coerce 0 length vectors to na values of the appropriate type:
# .zero_to_nas => function()
.zero_to_nas <- function(x){
if(identical(x, character(0))){
NA_character_
}else if(identical(x, integer(0))){
NA_integer_
}else if(identical(x, numeric(0))){
NA_real_
}else if(identical(x, complex(0))){
NA_complex_
}else if(identical(x, logical(0))){
NA
}else{
x
}
}
# Unlist cleaned list: res => vector
res <- unlist(lapply(lst, .zero_to_nas))
# Explictly define return object: vector => GlobalEnv()
return(res)
}
# Classify each article as belonging to the appropriate state:
# clean_df => data.frame
clean_df <- transform(
df2,
State = df1$Abbreviation[
match(
list_2_vec(
regmatches(
Article,
gregexpr(
paste0(df1$State_Name, collapse = "|"), Article
)
)
),
df1$State_Name
)
]
)
# Data:
df1 <- structure(list(Index = 1:6, State_Name = c("California", "Maryland",
"New York", "Texas", "Virginia", "Washington"), Abbreviation = c("CA",
"MD", "NY", "TX", "VA", "WA")), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(Index = 1:6, State = c(NA, NA, NA, NA, NA, NA),
Article = c("Texas governor, Abbott, signs new abortion bill",
"Effort to recall California governor Newsome loses steam",
"New York governor, Cuomo, accused of manipulating Covid-19 nursing home data",
"Hogan (Maryland, R) announces plans to lift statewide Covid restrictions",
"DC statehood unlikely as Manchin opposes", "Amazon HQ2 causing housing prices to soar in northern Virginia"
)), class = "data.frame", row.names = c(NA, -6L))
Upvotes: 1