Reputation: 141
I am trying to merge two fairly large dataframes of different sizes based on partial string matches.
df1$code contains all 12 digit codes, while df2$code contains a mix of codes with 10-12 digits, where some of the shorter codes are substring matches to the 12 digit codes in df1$code.
Therefore, I need to merge all 12 digit matches between the two dataframes, but also those records in df2 that have 10-11 digit codes that are substring matches to the df1.
Example dataframes:
df1 <- data.frame(code_1 = c('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
name = c('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom'))
df2 <- data.frame(code_2 = c('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156'),
color = c('blue', 'red', 'green', 'purple', 'orange', 'brown'))
df3 (merged)
code_1 code_2 name color
123456789012 123456789012 bob blue
210987654321 2109876543 joe red
567890543211 7890543211 sally green
987656789001 98765678900 john purple
123456654321 12345665432 lucy orange
678905432156 678905432156 alan brown
Upvotes: 3
Views: 4623
Reputation: 269606
Try this SQL join.
library(sqldf)
sqldf("select a.code_1, b.code_2, a.name, b.color
from df2 b left join df1 a on a.code_1 like '%' || b.code_2 || '%'")
giving:
code_1 code_2 name color
1 123456789012 123456789012 bob blue
2 210987654321 2109876543 joe red
3 567890543211 7890543211 sally green
4 987656789001 98765678900 john purple
5 123456654321 12345665432 lucy orange
6 678905432156 678905432156 alan brown
Update: Updated answer to reflect change in question so that (1) the substring can be anywhere in the target string and (2) names of code columns have changed to code_1
and code_2
.
Upvotes: 3
Reputation: 68146
In python/pandas, you can do:
from pandas import DataFrame, Series
df1 = DataFrame(dict(
code1 = ('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
name = ('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom')))
df2 = DataFrame(dict(
code2 = ('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156'),
color = ('blue', 'red', 'green', 'purple', 'orange', 'brown')))
matches = [df1[df1['code1'].str.contains(x)].index[0] for x in df2['code2']]
print(
df1.assign(subcode=Series(data=df2['code2'], index=matches))
.merge(df2, left_on='subcode', right_on='code2')
.drop('subcode', axis='columns')
)
And that dumps:
code1 name code2 color
0 123456789012 bob 123456789012 blue
1 210987654321 joe 2109876543 red
2 567890543211 sally 7890543211 green
3 987656789001 john 98765678900 purple
4 123456654321 lucy 12345665432 orange
5 678905432156 alan 678905432156 brown
Note: I hate using loops with dataframes, but this, uh, works, I guess.
Upvotes: 1
Reputation: 7312
Updated per new info. This should work:
df2$New <- lapply(df2$code_2, grep, df1$code_1,value=T)
combined <- merge(df1,df2, by.x="code_1", by.y="New")
code_1 name code_2 color
1 123456654321 lucy 12345665432 orange
2 123456789012 bob 123456789012 blue
3 210987654321 joe 2109876543 red
4 567890543211 sally 7890543211 green
5 678905432156 alan 678905432156 brown
6 987656789001 john 98765678900 purple
Upvotes: 1
Reputation: 18681
We can use grep
+ sapply
to extract indices of matches from df2$code
for each df1$code
and create a matchID
out of it. Next, we merge
on matchID
to get desired output:
df1$matchID = row.names(df1)
df2$matchID = sapply(df2$code, function(x) grep(x, df1$code))
df_merge = merge(df1, df2, by = "matchID")[-1]
Note that if a df1$code
does not match any df2$code
, df2$matchID
will be blank, and so would not merge with df1$matchID
.
Results:
> df2
code color matchID
1 123456789012 blue 1
2 2109876543 red 2
3 7890543211 green 3
4 98765678900 purple 4
5 12345665432 orange 5
6 678905432156 brown 6
7 14124124124 black
> df_merge
code.x name code.y color
1 123456789012 bob 123456789012 blue
2 210987654321 joe 2109876543 red
3 567890543211 sally 7890543211 green
4 987656789001 john 98765678900 purple
5 123456654321 lucy 12345665432 orange
6 678905432156 alan 678905432156 brown
Data (Added non-match for better demo):
df1 <- data.frame(code = c('123456789012', '210987654321', '567890543211', '987656789001', '123456654321', '678905432156', '768927461037', '780125634701', '673940175372', '167438501473'),
name = c('bob','joe','sally','john','lucy','alan', 'fred','stephanie','greg','tom'),
stringsAsFactors = FALSE)
df2 <- data.frame(code = c('123456789012','2109876543','7890543211','98765678900','12345665432','678905432156', '14124124124'),
color = c('blue', 'red', 'green', 'purple', 'orange', 'brown', 'black'),
stringsAsFactors = FALSE)
Upvotes: 1