SwagZ
SwagZ

Reputation: 817

pandas dataframe merge based on str.contains

I have two dataframe that I would like to merge based on if column value from df2 contains column value from df1. I've been trying to use str.contains and series.isin. But no luck so far. Example below.

df1

       Domain              Visits
         aaa                  1 
         bbb                  3
         ddd                  5

df2

       Domain                Ads
         aaa.com              2 
         bbb                  4
         c.com                3
         e.net                6

Merged Dataframe should be like this

mergeDF

       Domain              Visits       Ads
         aaa.com              1          2
         bbb                  3          4
         c.com                           3
         ddd                  5          
         e.net                           6

Thx in advance.

Upvotes: 4

Views: 5334

Answers (2)

user3483203
user3483203

Reputation: 51155

Setup

We can start by creating a new series which is the row each domain in df2 corresponds to in df1, if such a row exists:

r = '({})'.format('|'.join(df1.Domain))
merge_df = df2.Domain.str.extract(r, expand=False).fillna(df2.Domain)

Our merge_df now looks like this:

0      aaa
1      bbb
2    c.com
3    e.net
Name: Domain, dtype: object

Now we merge. We want an outer merge here, merging our first DataFrame on the Domain column, and our second DataFrame on our merge_df series:

df1.merge(df2.drop('Domain', 1), left_on='Domain', right_on=merge_df, how='outer')

 Domain  Visits  Ads
0    aaa     1.0  2.0
1    bbb     3.0  4.0
2    ddd     5.0  NaN
3  c.com     NaN  3.0
4  e.net     NaN  6.0

Since we stored a separate series to merge on, and didn't modify df2, we are done after this step, and you have your merged DataFrame.

Upvotes: 4

Scott Boston
Scott Boston

Reputation: 153460

Try and preprocess df2, to get domain in the same format as df1:

df2['key'] = df2['Domain'].str.extract('(\w+)\.?')

df_out = df1.merge(df2, left_on='Domain', right_on='key', suffixes=('_x',''), how='outer')

df_out['Domain'] = df_out['Domain'].combine_first(df_out['Domain_x'])

df_out = df_out.drop(['Domain_x','key'], axis=1)

print(df_out)

Output:

   Visits   Domain  Ads
0     1.0  aaa.com  2.0
1     3.0      bbb  4.0
2     5.0      ddd  NaN
3     NaN    c.com  3.0
4     NaN    e.net  6.0

Upvotes: 1

Related Questions