Reputation: 10646
I have one huge data frame and one small data frame.
big data frame called df1 something like this:
Hostname Region Model
ServerABC101 US Cisco
ServerABC102 US Cisco
ServerDDC103 PAC Intel
ServerDDC609 Emea Intel
ServerDDC103 PAC Intel
ServerDDC609 Emea Intel
Small data frame df2 is like this:
Site City State
ABC NYC NY
DDC DAL TX
I need to merge these two data frames base on text in df2['Site'] column matches df1['Hostname']
final data frame needs to be like this:
Hostname Region Model Site City State
ServerABC101 US Cisco. ABC NYC NY
ServerABC102 US Cisco. ABC NYC NY
ServerDDC103 PAC Intel DDC DAL TX
ServerDDC609 Emea Intel DDC DAL TX
ServerDDC103 PAC Intel DDC DAL TX
ServerDDC609 Emea Intel DDC DAL TX
I am familiar with pd merge but the Site from df2 is only partial text from Hostname in df1.
final=reduce(lambda s, y: pd.merge(x, y, on="Hostname", how=outer, [df1, df2])
Any ideas how I could do this in pandas?
Upvotes: 0
Views: 106
Reputation: 5470
You can use Regex to extract the Site code, form a new column using that and then merge. Assuming the site code is 3 upper-case letters following Server
:
df1['Site'] = df1['Hostname'].str.extract(r"Server([A-Z]{3})")
df1 = pd.merge(df1, df2, on = 'Site', how = 'left')
print(df1)
gives:
Hostname Region Model Site City State
0 ServerABC101 US Cisco ABC NYC NY
1 ServerABC102 US Cisco ABC NYC NY
2 ServerDDC103 PAC Intel DDC DAL TX
3 ServerDDC609 Emea Intel DDC DAL TX
4 ServerDDC103 PAC Intel DDC DAL TX
5 ServerDDC609 Emea Intel DDC DAL TX
Upvotes: 1
Reputation: 13257
Code
use left_on & right_on of merge
df1.merge(df2, how='left', left_on=df1['Hostname'].str.extract(r"Server([A-Z]+)(?=\d+$)")[0], right_on='Site')
The advantage of using left_on and right_on is that it does not inplace the original
Update
I am updating my answer because your reply
pat = '({})'.format('|'.join(df2['Site']))
df1.merge(df2, how='left', left_on=df1['Hostname'].str.extract(pat)[0], right_on='Site')
output:
Hostname Region Model Site City State
0 ServerABC101 US Cisco ABC NYC NY
2 ServerABC102 US Cisco ABC NYC NY
5 ServerDDC103 PAC Intel DDC DAL TX
7 ServerDDC609 Emea Intel DDC DAL TX
9 ServerDDC103 PAC Intel DDC DAL TX
11 ServerDDC609 Emea Intel DDC DAL TX
Upvotes: 0
Reputation: 997
First create a column called sever in your first dataset with some code like this. Then join as you normally would. You can then later drop the sever column if you want.
import pandas as pd
# Create a sample DataFrame
data = {'hostname': ['ServerABC101', 'ServerXYZ202', 'ServerDEF303']}
df = pd.DataFrame(data)
# Split the "hostname" column
df[['Server', 'a', '101']] = df['hostname'].str.split('(\d+)', expand=True)
df['Server'] = df['Server'].str.replace('Server', '')
df = df.drop('101', axis=1)
df = df.drop('a', axis=1)
# Display the updated DataFrame
print(df)
# hostname Server
# 0 ServerABC101 ABC
# 1 ServerXYZ202 XYZ
# 2 ServerDEF303 DEF
Upvotes: 0