GreenGodot
GreenGodot

Reputation: 6753

Changing the value in a dataframe column depending on the value of two columns in a diffrent dataframe

I have two dataframes made with Pandas in python:

df1

id    business   state   inBusiness
1     painter     AL        no
2     insurance   AL        no
3     lawyer      OH        no
4     dentist     NY        yes
...........

df2  

id    business    state
1     painter       NY
2     painter       AL
3     builder       TX
4     painter       AL    
......

Basically, I want to set the 'inBusiness' value in df1 to 'yes' if an instance of the exact same business/location combo exists in df2.

So for example, if painter/AL exists in df2, than all instances of painter/AL in df1 have their 'inBusiness' value set to yes.

The best I can come up with right now is this:

for index, row in df2.iterrows():
    df1[ (df1.business==str(row['business'])) & (df1.state==str(row['state']))]['inBusiness'] = 'Yes'

but the first dataframe can potentially have hundreds of thousands of rows to loop through for each row in the second dataframe so this method is not very reliable. Is there a nice one-liner I can use here that would also be quick?

Upvotes: 0

Views: 69

Answers (2)

Stefan
Stefan

Reputation: 42875

You could use .merge(how='left', indicator=True) (indicator was added in pandas>=0.17, see docs) to identify matching columns as well as the source of the match to get something along these lines:

df1.merge(df2, how='left', indicator=True) # merges by default on shared columns

   id   business state inBusiness     _merge
0   1    painter    AL         no       both
1   2  insurance    AL         no  left_only
2   3     lawyer    OH         no  left_only
3   4    dentist    NY        yes  left_only

The _merge indicates in which cases the (business, state) combinations are available in both df1 and df2. Then you just need to:

df['inBusiness'] = df._merge == 'both'

to get:

   id   business state inBusiness     _merge
0   1    painter    AL       True       both
1   2  insurance    AL      False  left_only
2   3     lawyer    OH      False  left_only
3   4    dentist    NY      False  left_only

Upvotes: 2

michael_j_ward
michael_j_ward

Reputation: 4559

Probably most efficient to create a map

inBusiness = {(business,state): 'yes' 
              for business,state in zip(df2['business'],df2['state'])}
df1['inBusiness'] = [ inBusiness.get((business,state),"no") 
                     for business,state in zip(df1['business'],df1['state'])]
df1

OUTPUTS

    id  business    state   inBusiness
0   1   painter     AL  yes
1   2   insurance   AL  no
2   3   lawyer      OH  no
3   4   dentist     NY  no

Explanation Edit:

You were vague about "explaining further," so I'll give a high level of everything

The built-in zip function that takes two iterables (like two lists, or two series) and "zips" them together into tuples.

a = [1,2,3]
b = ['a','b','c']
for tup in zip(a,b): print(tup)

outputs:

(1, 'a')
(2, 'b')
(3, 'c')

Additionally, tuples in python can be "unpacked" into individual variables

tup = (3,4)
x,y = tup
print(x)
print(y)

You can combine these two things to create dictionary comprehensions

newDict = {k: v for k,v in zip(a,b)}
newDict

Outputs:

{1: 'a', 2: 'b', 3: 'c'}

inBusiness is a python dictionary created using a dictionary comprehension after zipping together the series of df2['business'] and df2['state'].

I did not actually need to unpack the variables, but I did so for what I thought would be clarity.

Note that this map is only half of what you're hopping to do because every key (business,state) in the dictionary maps to yes. Thankfully, dict.get let's us specify a default value to return if the key is not found- which in your case is "no"

Then, the desired column is created using a list-comprehension to achieve your desired result.

Does that cover everything?

Upvotes: 1

Related Questions