Jason S
Jason S

Reputation: 189826

lookup first match in Pandas dataframe

I need to use a DataFrame as a lookup table on columns that are not part of the index. For example (this is a simple one just to illustrate):

import pandas as pd

westcoast = pd.DataFrame([['Washington','Olympia'],['Oregon','Salem'],
                          ['California','Sacramento']],
                        columns=['state','capital'])
print westcoast

        state     capital
0  Washington     Olympia
1      Oregon       Salem
2  California  Sacramento

It's easy to lookup and get a Series as an output:

westcoast[westcoast.state=='Oregon'].capital

1    Salem
Name: capital, dtype: object

but I want to obtain the string 'Salem':

westcoast[westcoast.state=='Oregon'].capital.values[0]

'Salem'

and the .values[0] seems somewhat clunky... is there a better way?

(FWIW: my real data has maybe 50 rows at most, but lots of columns, so if I do set an index column, no matter what column I choose, there will be a lookup operation like this that is not based on an index, and the relatively small number of rows means that I don't care if it's O(n) lookup.)

Upvotes: 5

Views: 9806

Answers (2)

jezrael
jezrael

Reputation: 863281

Yes, you can use Series.item if the lookup will always returns one element from the Series:

westcoast.loc[westcoast.state=='Oregon', 'capital'].item()

Exceptions can be handled if the lookup returns nothing, or one or more values and you need only the first item:

s = westcoast.loc[westcoast.state=='Oregon', 'capital']
s = np.nan if s.empty else s.iat[0] 
print (s) #Salem

s = westcoast.loc[westcoast.state=='New York', 'capital']
s = np.nan if s.empty else s.iat[0] 
print (s)
nan

A more general solution to handle the exceptions because there are 3 possible output scenarios:

westcoast = pd.DataFrame([['Washington','Olympia'],['Oregon','Salem'],
                          ['California','Sacramento'],['Oregon','Portland']],
                        columns=['state','capital'])

print (westcoast)
        state     capital
0  Washington     Olympia
1      Oregon       Salem
2  California  Sacramento
3      Oregon    Portland

s = westcoast.loc[westcoast.state=='Oregon', 'capital']

#if not value returned
if s.empty:
    s = 'no match'

#if only one value returned
elif len(s) == 1:
    s = s.item()
else:

# if multiple values returned, return a list of values
    s = s.tolist()

print (s)
['Salem', 'Portland']

It is possible to create a lookup function:

def look_up(a):
    s = westcoast.loc[westcoast.state==a, 'capital']
    #for no match
    if s.empty:
        return np.nan
    #for match only one value
    elif len(s) == 1:
        return s.item()
    else:
    #for return multiple values
        return s.tolist()

print (look_up('Oregon'))
['Salem', 'Portland']

print (look_up('California'))
Sacramento

print (look_up('New Yourk'))
nan

Upvotes: 6

unutbu
unutbu

Reputation: 880429

If you are going to do frequent lookups of this sort, then it pays to make state the index:

state_capitals = westcoast.set_index('state')['capital']
print(state_capitals['Oregon'])
# Salem

With an index, each lookup is O(1) on average, whereas westcoast['state']=='Oregon' requires O(n) comparisons. Of course, building the index is also O(n), so you would need to do many lookups for this to pay off.

At the same time, once you have state_capitals the syntax is simple and dict-like. That might be reason enough to build state_capitals.

Upvotes: 0

Related Questions