Reputation: 65
I am performing data manipulation in python
using pandas
on a very large dataset, say 100 million rows. I have two dataframes and wish to generate third dataframe as per the conditions mentioned, the scenario is explained below:
dataframe 1:
Col_B and Col_D are of int64
type
Col_A Col_B Col_C Col_D
A 11 B 20
A 11 C 24
B 14 R 32
... ... ... ...
dataframe 2:
Col_Z is of type float64
and remaining columns are of int64
Col_X Col_Y Col_P Col_Q Col_Z
10 15 16 21 0.99
10 15 17 22 0.89
... ... ... ... ...
... ... ... ... ...
11 15 30 35 0.67
... ... ... ... ...
Condition to be applied:
Consider only first row of both the dataframe, for the sake of understanding conditions:
if the value of (Col_B is between the value of Col_X and Col_Y) and value of (Col_D is between the value of Col_P and Col_Q) then return the corresponding value of Col_A, Col_C and Col_Z, otherwise return NaN
Expected Output (Dataframe 3):
Col_A Col_C Col_Z
A B 0.99
NaN NaN NaN
B R 0.67
Note: This output is generated merely considering if there are only these three rows in dataframes but in actual each value of Dataframe 1 has to scan all of the values in Dataframe 2 until desired conditions is achieved.
My Code:
df3 = {}
Col_A = []
Col_C = []
Col_Z = []
for i in df1.iterrows():
value = float(df2[(i[1][1] > df2['Col_X'].values) &
(i[1][1] < df2['Col_Y'].values) &
(i[1][3] > df2['Col_P'].values) &
(i[1][3] < df2['Col_Q'].values)]['Col_Z'])
if bool(value):
Col_Z.append(value)
Col_A.append(i[1][0])
Col_C.append(i[1][2])
else:
Col_Z.append(float('NaN'))
Col_A.append(float('NaN'))
Col_C.append(float('NaN'))
This code is working fine uptill the condition is met, as soon as condition does'nt met, it throws a TypeError
. Please can any rectify this.
Also, I wanted to know if there is any alternate and efficient way to perform it. Please let me know.
Upvotes: 0
Views: 259
Reputation: 65
After a few trials, I was able to solve my own code. Here it is - the rectified one:
df3 = {}
Col_A = []
Col_C = []
Col_Z = []
for i in df1.iterrows():
value = df2[(i[1][1] > df2['Col_X'].values) &
(i[1][1] < df2['Col_Y'].values) &
(i[1][3] > df2['Col_P'].values) &
(i[1][3] < df2['Col_Q'].values)]['Col_Z']
if value.empty:
continue
else:
Col_Z.append(value)
Col_A.append(i[1][0])
Col_C.append(i[1][2])
df3['A'] = Col_A
df3['C'] = Col_C
df3['Z'] = Col_Z
df3 = pd.DataFrame(df3)
However, due to the for
loop that is iterating over all the rows, this method may not be efficient for large datasets, say for 100-200 million.
Hoping to find some more efficient way!
Upvotes: 0
Reputation: 5502
New answer:
You have two embbeded loops. The first one is always a full loop. But not the second. So the question is how to improve the efficiency of the sub partial loop.
Here, I give you two ways to perform the second loop:
This discussion might give you some insights on how to perform the iterations.
# Import module
import numpy as np
df1 = pd.DataFrame([["A", 11, "B", 20],
["A", 11, "C", 24],
["B", 14, "R", 32]],
columns=["Col_A", "Col_B", "Col_C", "Col_D"])
df2 = pd.DataFrame([[10, 15, 16, 21, 0.99],
[10, 15, 17, 22, 0.89],
[11, 15, 30, 35, 0.67]],
columns=["Col_X", "Col_Y", "Col_P", "Col_Q", "Col_Z"])
def getCondition(row, df2):
# Iterate df2 till a row meets the condition
for _, row_df2 in df2.iterrows():
if row_df2.Col_X <= row.Col_B and row.Col_B < row_df2.Col_Y \
and row_df2.Col_P <= row.Col_D and row.Col_D < row_df2.Col_Q:
return pd.Series([row.Col_A, row.Col_C, row_df2.Col_Z])
return np.NaN
def getCondition2(row, df2):
# Find all rows matching the condition and select the first
condition = ((df2.Col_X <= row.Col_B) & (row.Col_B < df2.Col_Y)\
& (df2.Col_P <= row.Col_D) & (row.Col_D < df2.Col_Q))
if sum(condition) > 0:
return pd.Series([row.Col_A, row.Col_C, df2.Col_Z[condition].iloc[0]])
return np.NaN
# Apply the condition
output = df1.apply(getCondition2, args=[df2], axis=1)
print(output)
# 0 1 2
# 0 A B 0.99
# 1 NaN NaN NaN
# 2 B R 0.67
Old answer:
You can do this by considering the data set as a whole.
merge
function or just concat
. Here, I use concat
since another solution uses merge
. To be clear with what there performing, you can have a look at this.and
operator that becomes &
.Finally, you can call the where
function that returns Nan
when the condition isn't satisfied.
To fit the desired output, you can filter the columns using iloc
or just calling the columns name.
Here the code:
# Import module
import pandas as pd
df1 = pd.DataFrame([["A", 11, "B", 20],
["A", 11, "C", 24],
["B", 14, "R", 19]],
columns=["Col_A", "Col_B", "Col_C", "Col_D"])
df2 = pd.DataFrame([[10, 15, 16, 21, 0.99],
[10, 15, 17, 22, 0.89],
[11, 15, 16, 20, 0.67]],
columns=["Col_X", "Col_Y", "Col_P", "Col_Q", "Col_Z"])
# Concat the dataframe
df = pd.concat([df1, df2], axis=1)
print(df)
# Define the conditions
condition_col_b = ((df.Col_X <= df.Col_B) & (df.Col_B < df.Col_Y))
condition_col_d = ((df.Col_P <= df.Col_D) & (df.Col_D < df.Col_Q))
print(condition_col_b & condition_col_d)
# 0 True
# 1 False
# 2 True
# Apply the condition
output = df.where(condition_col_b & condition_col_d)
print(output)
# Col_A Col_B Col_C Col_D Col_X Col_Y Col_P Col_Q Col_Z
# 0 A 11.0 B 20.0 10.0 15.0 16.0 21.0 0.99
# 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 2 B 14.0 R 19.0 11.0 15.0 16.0 20.0 0.67
# Filter output
print(output[['Col_A', 'Col_C', 'Col_Z']])
# Col_A Col_C Col_Z
# 0 A B 0.99
# 1 NaN NaN NaN
# 2 B R 0.67
Upvotes: 2