Reputation: 55
I am looking for alternate ways to replace functions used in Excel with Python, especially with Pandas. One of the functions is COUNTIFS(), which I have been primarily using to locate specific row values in a fixed range. This is mainly used to determine, whether the specific values in one column are present in the other column, or not.
An example in Excel would look something like this:
The code for the first row (column: col1_in_col2):
=COUNTIFS($B$2:$B$6,A2)
I have tried to recreate the function in Pandas, only with the difference that the two columns can be found in two different DataFrames and the DataFrames are inside a dictionary (bigdict). The code is the following:
import pandas as pd
bigdict = {"df1": pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]}), "df2": pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})}
bigdict.get("df1")["df1_in_df2"] = bigdict.get("df1").apply(lambda x: 1 if x["col1"] in bigdict.get("df2")["col1"] else 0, axis=1)
In the example above, the first row should get a return value of zero, while the other rows should get return values of 1, since it can be found in the other DataFrame's column. However, the return value is 0 for every row.
Upvotes: 1
Views: 752
Reputation: 712
Easiest way in my opinion would be to make a generic function that you can apply anytime you want to do the equivalent of an excel countif().
import pandas as pd
def countif(x,col):
if x in col.values:
return 1
else:
return 0
return
df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]})
df2 = pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})
df1['df1_in_df2'] = df1['col1'].apply(countif, col=df2['col1'])
EDIT:
As rpanai mentioned in the comments, apply is known to have performance issues as your data grows. Using numpy vectorization would provide a large performance boost. Here is a modified version of Ashwini's answer.
import pandas as pd
import numpy as np
def countif(df1, df2, col1, col2, name):
df1[name] = np.where(df1[col1].isin(list(df2[col2])),1,0)
df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]})
df2 = pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})
countif(df1,df2,'col1','col1','df1_in_df2')
print(df1)
# col1 df1_in_df2
# 0 0110200_2016 0
# 1 011037_2016 1
# 2 011037_2016 1
# 3 0111054_2016 1
Upvotes: 0
Reputation: 13437
This is basically the same as @Ashwini's answer but you get rid of np.where
and iloc
which could make it more readable and eventually faster.
import pandas as pd
df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016",
"011037_2016", "0111054_2016"]})
df2 = pd.DataFrame({"col1" : ["011037_2016", "0111054_2016",
"011109_2016", "0111268_2016"]})
df1['df1_in_df2'] = df1["col1"].isin(df2['col1'].values).astype("int8")
UPDATE
Here I try to compare 4 methods @vlemaistre, @Ashwini, @SamLegesse and mine.
import pandas as pd
import numpy as np
# create fake data
n = int(1e6)
n1 = int(1e4)
df = pd.DataFrame()
df["col1"] = ["{:012}".format(i) for i in range(n)]
df2 = df.sample(n1)
toRemove = df2.sample(n1//2).index
df1 = df[~df.index.isin(toRemove)].sample(frac=1).reset_index(drop=True)
df2 = df2.reset_index(drop=True)
# backup dataframe
df0 = df1.copy()
bigdict = {"df1": df1, "df2": df2}
%%time
bigdict['df1']['df1_in_df2'] = [1 if x in bigdict['df2']['col1'].values else 0
for x in bigdict['df1']['col1']]
CPU times: user 4min 53s, sys: 3.08 s, total: 4min 56s
Wall time: 4min 41s
def countif(x,col):
if x in col.values:
return 1
else:
return 0
return
df1 = df0.copy()
%%time
df1['df1_in_df2'] = df1['col1'].apply(countif, col=df2['col1'])
CPU times: user 4min 48s, sys: 2.66 s, total: 4min 50s
Wall time: 4min 38s
df1 = df0.copy()
%%time
df1['df1_in_df2'] = np.where(df1.iloc[:,0].isin(list(df2.iloc[:,0])),1,0)
CPU times: user 167 ms, sys: 0 ns, total: 167 ms
Wall time: 165 ms
This is perfectly on par with Ashwini's solution
df1 = df0.copy()
%%time
df1['df1_in_df2'] = df1["col1"].isin(df2['col1'].values).astype("int8")
CPU times: user 152 ms, sys: 0 ns, total: 152 ms
Wall time: 150 ms
The vectorial methods are (at least) 1684x faster than the one using apply
.
Upvotes: 1
Reputation: 3331
Here is way to do it using a list comprehension :
bigdict['df1']['df1_in_df2'] = [1 if x in bigdict['df2']['col1'].values else 0
for x in bigdict['df1']['col1']]
Output :
col1 df1_in_df2
0 0110200_2016 0
1 011037_2016 1
2 011037_2016 1
3 0111054_2016 1
Upvotes: 1
Reputation: 393
Try this. I unstacked your dictionary into two dataframes and compared its values.
df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]})
df2= pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})
df1['df1_in_df2'] = np.where(df1.iloc[:,0].isin(list(df2.iloc[:,0])),1,0)
Upvotes: 3