Márton Szekeres
Márton Szekeres

Reputation: 55

Finding a specific value for each row in a Dataframe from another Dataframe's column

I am looking for alternate ways to replace functions used in Excel with Python, especially with Pandas. One of the functions is COUNTIFS(), which I have been primarily using to locate specific row values in a fixed range. This is mainly used to determine, whether the specific values in one column are present in the other column, or not.

An example in Excel would look something like this:

enter image description here

The code for the first row (column: col1_in_col2):

=COUNTIFS($B$2:$B$6,A2)

I have tried to recreate the function in Pandas, only with the difference that the two columns can be found in two different DataFrames and the DataFrames are inside a dictionary (bigdict). The code is the following:

import pandas as pd

bigdict = {"df1": pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]}), "df2": pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})}

bigdict.get("df1")["df1_in_df2"] = bigdict.get("df1").apply(lambda x: 1 if x["col1"] in bigdict.get("df2")["col1"] else 0, axis=1)

In the example above, the first row should get a return value of zero, while the other rows should get return values of 1, since it can be found in the other DataFrame's column. However, the return value is 0 for every row.

Upvotes: 1

Views: 752

Answers (4)

samredai
samredai

Reputation: 712

Easiest way in my opinion would be to make a generic function that you can apply anytime you want to do the equivalent of an excel countif().

import pandas as pd

def countif(x,col):
    if x in col.values:
        return 1
    else:
        return 0
    return

df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]})
df2 = pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})

df1['df1_in_df2'] = df1['col1'].apply(countif, col=df2['col1'])

EDIT:

As rpanai mentioned in the comments, apply is known to have performance issues as your data grows. Using numpy vectorization would provide a large performance boost. Here is a modified version of Ashwini's answer.

import pandas as pd
import numpy as np

def countif(df1, df2, col1, col2, name):
    df1[name] = np.where(df1[col1].isin(list(df2[col2])),1,0)

df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]})
df2 = pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})

countif(df1,df2,'col1','col1','df1_in_df2')

print(df1)
#            col1  df1_in_df2
# 0  0110200_2016           0
# 1   011037_2016           1
# 2   011037_2016           1
# 3  0111054_2016           1

Upvotes: 0

rpanai
rpanai

Reputation: 13437

This is basically the same as @Ashwini's answer but you get rid of np.where and iloc which could make it more readable and eventually faster.

import pandas as pd

df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016",
                             "011037_2016", "0111054_2016"]})

df2 = pd.DataFrame({"col1" : ["011037_2016", "0111054_2016",
                              "011109_2016", "0111268_2016"]})

df1['df1_in_df2'] = df1["col1"].isin(df2['col1'].values).astype("int8")

UPDATE

Timing

Here I try to compare 4 methods @vlemaistre, @Ashwini, @SamLegesse and mine.

import pandas as pd
import numpy as np

# create fake data
n = int(1e6)
n1 = int(1e4)

df = pd.DataFrame()
df["col1"] = ["{:012}".format(i) for i in range(n)]

df2 = df.sample(n1)
toRemove = df2.sample(n1//2).index
df1 = df[~df.index.isin(toRemove)].sample(frac=1).reset_index(drop=True)
df2 = df2.reset_index(drop=True)

# backup dataframe
df0 = df1.copy()

vlemaistre

bigdict = {"df1": df1, "df2": df2}
%%time
bigdict['df1']['df1_in_df2'] = [1 if x in bigdict['df2']['col1'].values else 0
                                for x in bigdict['df1']['col1']]
CPU times: user 4min 53s, sys: 3.08 s, total: 4min 56s
Wall time: 4min 41s

SamLegesse

def countif(x,col):
    if x in col.values:
        return 1
    else:
        return 0
    return
df1 = df0.copy()

%%time
df1['df1_in_df2'] = df1['col1'].apply(countif, col=df2['col1'])
CPU times: user 4min 48s, sys: 2.66 s, total: 4min 50s
Wall time: 4min 38s

Ashwini

df1 = df0.copy()
%%time
df1['df1_in_df2'] = np.where(df1.iloc[:,0].isin(list(df2.iloc[:,0])),1,0)
CPU times: user 167 ms, sys: 0 ns, total: 167 ms
Wall time: 165 ms

rpanai

This is perfectly on par with Ashwini's solution

df1 = df0.copy()
%%time
df1['df1_in_df2'] = df1["col1"].isin(df2['col1'].values).astype("int8")
CPU times: user 152 ms, sys: 0 ns, total: 152 ms
Wall time: 150 ms

Conclusion

The vectorial methods are (at least) 1684x faster than the one using apply.

Upvotes: 1

vlemaistre
vlemaistre

Reputation: 3331

Here is way to do it using a list comprehension :

bigdict['df1']['df1_in_df2'] = [1 if x in bigdict['df2']['col1'].values else 0
                                for x in bigdict['df1']['col1']]

Output :

           col1  df1_in_df2
0  0110200_2016           0
1   011037_2016           1
2   011037_2016           1
3  0111054_2016           1

Upvotes: 1

Ashwini
Ashwini

Reputation: 393

Try this. I unstacked your dictionary into two dataframes and compared its values.

df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]})
df2= pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})

df1['df1_in_df2'] = np.where(df1.iloc[:,0].isin(list(df2.iloc[:,0])),1,0)

Upvotes: 3

Related Questions