vethno
vethno

Reputation: 141

Merge two dataframes by partial string match

I am trying to merge two fairly large dataframes of different sizes based on partial string matches.

df1$code contains all 12 digit codes, while df2$code contains a mix of codes with 10-12 digits, where some of the shorter codes are substring matches to the 12 digit codes in df1$code.

Therefore, I need to merge all 12 digit matches between the two dataframes, but also those records in df2 that have 10-11 digit codes that are substring matches to the df1.

Example dataframes:

df1 <- data.frame(code_1 = c('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
              name = c('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom'))

df2 <- data.frame(code_2 = c('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156'),
              color = c('blue', 'red', 'green', 'purple', 'orange', 'brown'))

df3 (merged)

code_1         code_2         name  color
123456789012   123456789012   bob   blue
210987654321   2109876543     joe   red
567890543211   7890543211     sally green
987656789001   98765678900    john  purple
123456654321   12345665432    lucy  orange
678905432156   678905432156   alan  brown

Upvotes: 3

Views: 4623

Answers (4)

G. Grothendieck
G. Grothendieck

Reputation: 269606

Try this SQL join.

library(sqldf)

sqldf("select a.code_1, b.code_2, a.name, b.color 
       from df2 b left join df1 a on a.code_1 like '%' || b.code_2 || '%'")

giving:

        code_1       code_2  name  color
1 123456789012 123456789012   bob   blue
2 210987654321   2109876543   joe    red
3 567890543211   7890543211 sally  green
4 987656789001  98765678900  john purple
5 123456654321  12345665432  lucy orange
6 678905432156 678905432156  alan  brown

Update: Updated answer to reflect change in question so that (1) the substring can be anywhere in the target string and (2) names of code columns have changed to code_1 and code_2.

Upvotes: 3

Paul H
Paul H

Reputation: 68146

In python/pandas, you can do:

from pandas import DataFrame, Series
df1 = DataFrame(dict(
        code1 = ('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
        name = ('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom')))

df2 = DataFrame(dict(
        code2 = ('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156'),
        color = ('blue', 'red', 'green', 'purple', 'orange', 'brown')))

matches = [df1[df1['code1'].str.contains(x)].index[0] for x in df2['code2']]

print(
    df1.assign(subcode=Series(data=df2['code2'], index=matches))
       .merge(df2, left_on='subcode', right_on='code2')
       .drop('subcode', axis='columns')
)

And that dumps:

          code1   name         code2   color
0  123456789012    bob  123456789012    blue
1  210987654321    joe    2109876543     red
2  567890543211  sally    7890543211   green
3  987656789001   john   98765678900  purple
4  123456654321   lucy   12345665432  orange
5  678905432156   alan  678905432156   brown

Note: I hate using loops with dataframes, but this, uh, works, I guess.

Upvotes: 1

Mako212
Mako212

Reputation: 7312

Updated per new info. This should work:

df2$New <- lapply(df2$code_2, grep, df1$code_1,value=T)

combined <- merge(df1,df2, by.x="code_1", by.y="New")

        code_1  name       code_2  color
1 123456654321  lucy  12345665432 orange
2 123456789012   bob 123456789012   blue
3 210987654321   joe   2109876543    red
4 567890543211 sally   7890543211  green
5 678905432156  alan 678905432156  brown
6 987656789001  john  98765678900 purple

Upvotes: 1

acylam
acylam

Reputation: 18681

We can use grep + sapply to extract indices of matches from df2$code for each df1$code and create a matchID out of it. Next, we merge on matchID to get desired output:

df1$matchID = row.names(df1)
df2$matchID = sapply(df2$code, function(x) grep(x, df1$code))

df_merge = merge(df1, df2, by = "matchID")[-1]

Note that if a df1$code does not match any df2$code, df2$matchID will be blank, and so would not merge with df1$matchID.

Results:

> df2
          code  color matchID
1 123456789012   blue       1
2   2109876543    red       2
3   7890543211  green       3
4  98765678900 purple       4
5  12345665432 orange       5
6 678905432156  brown       6
7  14124124124  black        

> df_merge
        code.x  name       code.y  color
1 123456789012   bob 123456789012   blue
2 210987654321   joe   2109876543    red
3 567890543211 sally   7890543211  green
4 987656789001  john  98765678900 purple
5 123456654321  lucy  12345665432 orange
6 678905432156  alan 678905432156  brown

Data (Added non-match for better demo):

df1 <- data.frame(code = c('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
                  name = c('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom'),
                  stringsAsFactors = FALSE)

df2 <- data.frame(code = c('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156', '14124124124'),
                  color = c('blue', 'red', 'green', 'purple', 'orange', 'brown', 'black'),
                  stringsAsFactors = FALSE)

Upvotes: 1

Related Questions