user1471980
user1471980

Reputation: 10646

how do you merge two data frames based on common text on one data frame which is part of another data frame column value

I have one huge data frame and one small data frame.

big data frame called df1 something like this:

Hostname     Region    Model
ServerABC101  US     Cisco
ServerABC102  US     Cisco
ServerDDC103  PAC    Intel
ServerDDC609  Emea   Intel
ServerDDC103  PAC    Intel
ServerDDC609  Emea   Intel

Small data frame df2 is like this:

Site  City State
ABC   NYC  NY
DDC   DAL  TX

I need to merge these two data frames base on text in df2['Site'] column matches df1['Hostname']

final data frame needs to be like this:

Hostname     Region    Model Site  City State
ServerABC101  US     Cisco.  ABC   NYC  NY
ServerABC102  US     Cisco.  ABC   NYC  NY
ServerDDC103  PAC    Intel   DDC   DAL  TX
ServerDDC609  Emea   Intel   DDC   DAL  TX
ServerDDC103  PAC    Intel   DDC   DAL  TX
ServerDDC609  Emea   Intel   DDC   DAL  TX

I am familiar with pd merge but the Site from df2 is only partial text from Hostname in df1.

final=reduce(lambda s, y: pd.merge(x, y, on="Hostname", how=outer, [df1, df2])

Any ideas how I could do this in pandas?

Upvotes: 0

Views: 106

Answers (3)

user19077881
user19077881

Reputation: 5470

You can use Regex to extract the Site code, form a new column using that and then merge. Assuming the site code is 3 upper-case letters following Server:

df1['Site'] = df1['Hostname'].str.extract(r"Server([A-Z]{3})")

df1 = pd.merge(df1, df2, on = 'Site', how = 'left')

print(df1)

gives:

       Hostname Region  Model Site City State
0  ServerABC101     US  Cisco  ABC  NYC    NY
1  ServerABC102     US  Cisco  ABC  NYC    NY
2  ServerDDC103    PAC  Intel  DDC  DAL    TX
3  ServerDDC609   Emea  Intel  DDC  DAL    TX
4  ServerDDC103    PAC  Intel  DDC  DAL    TX
5  ServerDDC609   Emea  Intel  DDC  DAL    TX

Upvotes: 1

Panda Kim
Panda Kim

Reputation: 13257

Code

use left_on & right_on of merge

df1.merge(df2, how='left', left_on=df1['Hostname'].str.extract(r"Server([A-Z]+)(?=\d+$)")[0], right_on='Site')

The advantage of using left_on and right_on is that it does not inplace the original


Update

I am updating my answer because your reply

pat = '({})'.format('|'.join(df2['Site']))
df1.merge(df2, how='left', left_on=df1['Hostname'].str.extract(pat)[0], right_on='Site')

output:

    Hostname        Region  Model   Site    City    State
0   ServerABC101    US      Cisco   ABC     NYC     NY
2   ServerABC102    US      Cisco   ABC     NYC     NY
5   ServerDDC103    PAC     Intel   DDC     DAL     TX
7   ServerDDC609    Emea    Intel   DDC     DAL     TX
9   ServerDDC103    PAC     Intel   DDC     DAL     TX
11  ServerDDC609    Emea    Intel   DDC     DAL     TX

Upvotes: 0

Hooded 0ne
Hooded 0ne

Reputation: 997

First create a column called sever in your first dataset with some code like this. Then join as you normally would. You can then later drop the sever column if you want.

import pandas as pd

# Create a sample DataFrame
data = {'hostname': ['ServerABC101', 'ServerXYZ202', 'ServerDEF303']}
df = pd.DataFrame(data)

# Split the "hostname" column
df[['Server', 'a', '101']] = df['hostname'].str.split('(\d+)', expand=True)
df['Server'] = df['Server'].str.replace('Server', '')
df = df.drop('101', axis=1) 
df = df.drop('a', axis=1) 


# Display the updated DataFrame
print(df)



   #   hostname        Server
   # 0  ServerABC101    ABC
   # 1  ServerXYZ202    XYZ
   # 2  ServerDEF303    DEF

Upvotes: 0

Related Questions