How to generate a new dataframe in python by considering conditions from other dataframes?

Question

I am performing data manipulation in python using pandas on a very large dataset, say 100 million rows. I have two dataframes and wish to generate third dataframe as per the conditions mentioned, the scenario is explained below:

dataframe 1:
Col_B and Col_D are of int64 type

Col_A   Col_B   Col_C   Col_D
 A       11      B       20
 A       11      C       24
 B       14      R       32
...      ...    ...      ...

dataframe 2:
Col_Z is of type float64 and remaining columns are of int64

Col_X   Col_Y   Col_P   Col_Q   Col_Z
 10      15      16      21      0.99
 10      15      17      22      0.89
...     ...     ...     ...      ...
...     ...     ...     ...      ...
 11      15      30      35      0.67
...     ...     ...     ...      ...

Condition to be applied: Consider only first row of both the dataframe, for the sake of understanding conditions:

if the value of (Col_B is between the value of Col_X and Col_Y) and value of (Col_D is between the value of Col_P and Col_Q) then return the corresponding value of Col_A, Col_C and Col_Z, otherwise return NaN

Expected Output (Dataframe 3):

Col_A   Col_C   Col_Z
 A       B       0.99
NaN     NaN      NaN
 B       R       0.67

Note: This output is generated merely considering if there are only these three rows in dataframes but in actual each value of Dataframe 1 has to scan all of the values in Dataframe 2 until desired conditions is achieved.

My Code:

df3 = {}
Col_A = []
Col_C = []
Col_Z = []
for i in df1.iterrows():    
    value = float(df2[(i[1][1] > df2['Col_X'].values) &
      (i[1][1] < df2['Col_Y'].values) &
      (i[1][3] > df2['Col_P'].values) &
      (i[1][3] < df2['Col_Q'].values)]['Col_Z'])

    if bool(value):
        Col_Z.append(value)
        Col_A.append(i[1][0])
        Col_C.append(i[1][2])
    else:
        Col_Z.append(float('NaN'))
        Col_A.append(float('NaN'))
        Col_C.append(float('NaN'))

This code is working fine uptill the condition is met, as soon as condition does'nt met, it throws a TypeError. Please can any rectify this.

Also, I wanted to know if there is any alternate and efficient way to perform it. Please let me know.

Rawss24 · Accepted Answer

After a few trials, I was able to solve my own code. Here it is - the rectified one:

df3 = {}
Col_A = []
Col_C = []
Col_Z = []
for i in df1.iterrows():    
    value = df2[(i[1][1] > df2['Col_X'].values) &
      (i[1][1] < df2['Col_Y'].values) &
      (i[1][3] > df2['Col_P'].values) &
      (i[1][3] < df2['Col_Q'].values)]['Col_Z']

    if value.empty:
        continue
    else:
        Col_Z.append(value)
        Col_A.append(i[1][0])
        Col_C.append(i[1][2])

df3['A'] = Col_A
df3['C'] = Col_C
df3['Z'] = Col_Z
df3 = pd.DataFrame(df3)

However, due to the for loop that is iterating over all the rows, this method may not be efficient for large datasets, say for 100-200 million. Hoping to find some more efficient way!

How to generate a new dataframe in python by considering conditions from other dataframes?

Answers (2)

Related Questions