Reputation: 31
I have two data frames:
DF1: One string column of 200 rows
DF2: Two columns of 130 rows:
The first column has a substring contained in DF1$messy
The second column has another string with the proper nomenclature.
Looks like this
DF1:
-----------------
| messy |
| abc.'123_c |
| def.'456_c |
| hij.'789_c |
DF2:
-----------------
| old_str | new_str |
| 123 | aa |
| 789 | cc |
I need a way to bring in the field DF2$new_str into DF1, by matching the field DF1$messy to the substring DF2$old_str
I first tried writing a for loop, expecting that the the function would apply over the entirety of the columns:
df1$new.str <- 0
correct_field <-
for (i in 1:nrow(df1)){
IF (df1$messy[i] == df2$old_str)
df1$clean[i] = df2$new_str
}
Second I tried the difference_left_join
function:
library(dplyr)
library(fuzzyjoin)
test<-difference_left_join(df1, df2, by = c(cut = 'Data.Lake.Field' )
(I renamed the join column in both sets 'Data.Lake.Field')
Desired Output:
-----------------
| messy | new_str |
| abc.'123_c | aa |
| def.'456_c | NULL |
| hij.'789_c | cc |
EDIT: I think, theoretically, the best was to tackle this would be a FOR loop, with a new column in DF1 that takes a row [i] in DF1$messy, and searches for a match in DF2$old, and populates the field with the exact value of DF2$new that correlates to that DF2$old value. I'm not sure how to do this, as using the [i] suffix to a row always returns that same row number... For example, how would I take a a row - say, DF1$messy[142] - and create a fuzzy match with, say DF2$old[15], and create a new column that returns DF2$new[15] ???
Upvotes: 1
Views: 100
Reputation: 1114
You can collapse your search terms and match them with the 'messy' terms in the original data. Then use the matches to join the data frames.
df1 <- data.frame(messy=c('abc','def','ghi'),stringsAsFactors = F)
df2 <- data.frame(old=c('ab','ef','hi'),new=c(1:3),
stringsAsFactors = F)
df2$messy <- grep(paste(df2$old,collapse = '|'),df1$messy,value = T)
left_join(df2,df1)
Joining, by = "messy"
old new messy
1 ab 1 abc
2 ef 2 def
3 hi 3 ghi
Upvotes: 2