Nathan-Luke Anderson
Nathan-Luke Anderson

Reputation: 31

Replacing part of a string using a lookup table

I have two data frames:

DF1: One string column of 200 rows

DF2: Two columns of 130 rows:

Looks like this

DF1: 
-----------------
|   messy   |

| abc.'123_c |

| def.'456_c |

| hij.'789_c |


DF2:
-----------------
|   old_str      |    new_str    |

|    123         |      aa       |

|    789         |      cc       |

I need a way to bring in the field DF2$new_str into DF1, by matching the field DF1$messy to the substring DF2$old_str

I first tried writing a for loop, expecting that the the function would apply over the entirety of the columns:

    df1$new.str <- 0
    correct_field <-
     for (i in 1:nrow(df1)){ 
      IF (df1$messy[i] == df2$old_str)
      df1$clean[i] = df2$new_str
    }

Second I tried the difference_left_join function:

    library(dplyr)
    library(fuzzyjoin)
    test<-difference_left_join(df1, df2, by = c(cut = 'Data.Lake.Field' )

(I renamed the join column in both sets 'Data.Lake.Field')

Desired Output:

-----------------
|   messy      |    new_str    |

|    abc.'123_c         |      aa       |

|    def.'456_c         |     NULL      |

|    hij.'789_c         |      cc       |

EDIT: I think, theoretically, the best was to tackle this would be a FOR loop, with a new column in DF1 that takes a row [i] in DF1$messy, and searches for a match in DF2$old, and populates the field with the exact value of DF2$new that correlates to that DF2$old value. I'm not sure how to do this, as using the [i] suffix to a row always returns that same row number... For example, how would I take a a row - say, DF1$messy[142] - and create a fuzzy match with, say DF2$old[15], and create a new column that returns DF2$new[15] ???

Upvotes: 1

Views: 100

Answers (1)

kstew
kstew

Reputation: 1114

You can collapse your search terms and match them with the 'messy' terms in the original data. Then use the matches to join the data frames.

df1 <- data.frame(messy=c('abc','def','ghi'),stringsAsFactors = F)
df2 <- data.frame(old=c('ab','ef','hi'),new=c(1:3),
                  stringsAsFactors = F)

df2$messy <- grep(paste(df2$old,collapse = '|'),df1$messy,value = T)

left_join(df2,df1)

Joining, by = "messy"
  old new messy
1  ab   1   abc
2  ef   2   def
3  hi   3   ghi

Upvotes: 2

Related Questions