Snobby
Snobby

Reputation: 1155

VLOOKUP Excel analog in pandas python

I have 2 dataFrames.

df1:

index ID City         Region 2City
1     23 Moscow       Msk    
2     34 Obninsk      Msk    Msk
3     56                     Spb
4     17 Tula         Spb

df2:

index City   Office
1     Msk    Msk
2     Spb    Spb
3     Tula   Msk
4     Moscow Msk

I would like to get the following df:

index ID City         Region 2City Office
1     23 Moscow       Msk          Msk
2     34 Obninsk      Msk    Msk   Msk
3     56                     Spb   Spb
4     17 Tula         Spb          Msk

So it checks which 'office' from df2 matches 'City', 'Region', '2City' from df1.

Also if I found 'office' for 'City', I stop to search. So 'City'column have the priority, then comes 'Region', then comes '2City'.

I know how to do this with 3 for, but I hope there is a better option.

Upvotes: 1

Views: 4932

Answers (4)

Alexander
Alexander

Reputation: 109546

One generally does a VLOOKUP in Pandas using join (which defaults to a left join) or merge (where one must specify how='left').

In your case, you are trying to lookup a value based on three columns. One way to do this is by using or to take the first non-null value.

>>> (df1
     .assign(temp = [region or two_city or city 
                     for region, two_city, city in zip(df1.Region, df1['2City'], df1.City)])
     .merge(df2, how='left', right_on='City', left_on='temp', suffixes=['', '_'])
     .drop(['temp', 'City_'], axis=1))

   ID     City Region 2City Office
0  23   Moscow    Msk  None    Msk
1  34  Obninsk    Msk   Msk    Msk
2  56     None   None   Spb    Spb
3  17     Tula    Spb  None    Msk

Upvotes: 1

PaSTE
PaSTE

Reputation: 4548

import pandas as pd
df1 = pd.DataFrame([[23, 'Moscow', 'Msk', ''],
                    [34, 'Obninsk', 'Msk', 'Msk'],
                    [56, '', '', 'Spb'],
                    [17, 'Tula', 'Spb', '']],
                   columns=['ID', 'City', 'Region', '2City'])
df2 = pd.DataFrame([['Msk', 'Msk'],
                    ['Spb', 'Spb'],
                    ['Tula', 'Msk'],
                    ['Moscow', 'Msk']],
                   columns=['City', 'Office'])

df = pd.concat([df1.loc[df1[x].isin(df2['City']), x] for x in ['City', 'Region', '2City']])
df1['Join'] = df.groupby(df.index).first()
output = df1.merge(df2, left_on='Join', right_on='City', how='right')

At this point, the pandas.DataFrame called "output" will contain a column called "Office" indexed appropriately to match with df1. You could do something like:

df1['Office'] = output['Office']

And this will give you the DataFrame you requested plus the additional column "Join", which you can drop via:

df1.drop('Join', axis=1, inplace=True)

The OP here wants basically to do three separate merges--one on each of the columns "City", "Region", and "2City" from df1. It is therefore not easy to do this using basic pandas.DataFrame operations without some gymnastics. I just hid the for loop in the list comprehension, but it's still there.

Upvotes: 0

IanS
IanS

Reputation: 16241

Without a for loop:

cols = ['City', 'Region', '2City']

df1[cols].applymap(lambda x: df2.set_index('City')['Office'].get(x)) \
         .apply(lambda row: row[row.first_valid_index()], axis=1)

I set City as the index for df2 in order to lookup values with the get method. After looking up the values element-wise with applymap, I find the first valid answer with first_valid_index.

Upvotes: 0

RafG
RafG

Reputation: 1506

You can merge the two DataFrames with pd.merge, but as I understand it, you actually want to merge on different columns in df1. A possible way to do this is by adding an extra column that takes the value of 'City' if available (otherwise'region' or '2City').

import pandas as pd

df1['Office'] = df1.City.fillna(df1.Region).fillna(df1['2City'])
df = pd.merge(df1, df2.reindex(columns='Office'), on='Office')

You don't specify if you have missing values in either DataFrame. If so, you can control their handling with the how argument to pd.merge.

Upvotes: 0

Related Questions