Reputation: 1155
I have 2 dataFrames.
df1:
index ID City Region 2City
1 23 Moscow Msk
2 34 Obninsk Msk Msk
3 56 Spb
4 17 Tula Spb
df2:
index City Office
1 Msk Msk
2 Spb Spb
3 Tula Msk
4 Moscow Msk
I would like to get the following df:
index ID City Region 2City Office
1 23 Moscow Msk Msk
2 34 Obninsk Msk Msk Msk
3 56 Spb Spb
4 17 Tula Spb Msk
So it checks which 'office'
from df2 matches 'City'
, 'Region'
, '2City'
from df1.
Also if I found 'office'
for 'City'
, I stop to search. So 'City'
column have the priority, then comes 'Region'
, then comes '2City'
.
I know how to do this with 3 for
, but I hope there is a better option.
Upvotes: 1
Views: 4932
Reputation: 109546
One generally does a VLOOKUP in Pandas using join
(which defaults to a left join) or merge
(where one must specify how='left'
).
In your case, you are trying to lookup a value based on three columns. One way to do this is by using or
to take the first non-null value.
>>> (df1
.assign(temp = [region or two_city or city
for region, two_city, city in zip(df1.Region, df1['2City'], df1.City)])
.merge(df2, how='left', right_on='City', left_on='temp', suffixes=['', '_'])
.drop(['temp', 'City_'], axis=1))
ID City Region 2City Office
0 23 Moscow Msk None Msk
1 34 Obninsk Msk Msk Msk
2 56 None None Spb Spb
3 17 Tula Spb None Msk
Upvotes: 1
Reputation: 4548
import pandas as pd
df1 = pd.DataFrame([[23, 'Moscow', 'Msk', ''],
[34, 'Obninsk', 'Msk', 'Msk'],
[56, '', '', 'Spb'],
[17, 'Tula', 'Spb', '']],
columns=['ID', 'City', 'Region', '2City'])
df2 = pd.DataFrame([['Msk', 'Msk'],
['Spb', 'Spb'],
['Tula', 'Msk'],
['Moscow', 'Msk']],
columns=['City', 'Office'])
df = pd.concat([df1.loc[df1[x].isin(df2['City']), x] for x in ['City', 'Region', '2City']])
df1['Join'] = df.groupby(df.index).first()
output = df1.merge(df2, left_on='Join', right_on='City', how='right')
At this point, the pandas.DataFrame
called "output" will contain a column called "Office" indexed appropriately to match with df1. You could do something like:
df1['Office'] = output['Office']
And this will give you the DataFrame you requested plus the additional column "Join", which you can drop via:
df1.drop('Join', axis=1, inplace=True)
The OP here wants basically to do three separate merges--one on each of the columns "City", "Region", and "2City" from df1
. It is therefore not easy to do this using basic pandas.DataFrame
operations without some gymnastics. I just hid the for loop in the list comprehension, but it's still there.
Upvotes: 0
Reputation: 16241
Without a for
loop:
cols = ['City', 'Region', '2City']
df1[cols].applymap(lambda x: df2.set_index('City')['Office'].get(x)) \
.apply(lambda row: row[row.first_valid_index()], axis=1)
I set City
as the index for df2
in order to lookup values with the get
method. After looking up the values element-wise with applymap
, I find the first valid answer with first_valid_index
.
Upvotes: 0
Reputation: 1506
You can merge the two DataFrames with pd.merge
, but as I understand it, you actually want to merge on different columns in df1
. A possible way to do this is by adding an extra column that takes the value of 'City'
if available (otherwise'region'
or '2City'
).
import pandas as pd
df1['Office'] = df1.City.fillna(df1.Region).fillna(df1['2City'])
df = pd.merge(df1, df2.reindex(columns='Office'), on='Office')
You don't specify if you have missing values in either DataFrame. If so, you can control their handling with the how
argument to pd.merge
.
Upvotes: 0