Rawss24
Rawss24

Reputation: 65

How to generate a new dataframe in python by considering conditions from other dataframes?

I am performing data manipulation in python using pandas on a very large dataset, say 100 million rows. I have two dataframes and wish to generate third dataframe as per the conditions mentioned, the scenario is explained below:

dataframe 1:
Col_B and Col_D are of int64 type

Col_A   Col_B   Col_C   Col_D
 A       11      B       20
 A       11      C       24
 B       14      R       32
...      ...    ...      ...

dataframe 2:
Col_Z is of type float64 and remaining columns are of int64

Col_X   Col_Y   Col_P   Col_Q   Col_Z
 10      15      16      21      0.99
 10      15      17      22      0.89
...     ...     ...     ...      ...
...     ...     ...     ...      ...
 11      15      30      35      0.67
...     ...     ...     ...      ...

Condition to be applied: Consider only first row of both the dataframe, for the sake of understanding conditions:

if the value of (Col_B is between the value of Col_X and Col_Y) and value of (Col_D is between the value of Col_P and Col_Q) then return the corresponding value of Col_A, Col_C and Col_Z, otherwise return NaN

Expected Output (Dataframe 3):

Col_A   Col_C   Col_Z
 A       B       0.99
NaN     NaN      NaN
 B       R       0.67

Note: This output is generated merely considering if there are only these three rows in dataframes but in actual each value of Dataframe 1 has to scan all of the values in Dataframe 2 until desired conditions is achieved.

My Code:

df3 = {}
Col_A = []
Col_C = []
Col_Z = []
for i in df1.iterrows():    
    value = float(df2[(i[1][1] > df2['Col_X'].values) &
      (i[1][1] < df2['Col_Y'].values) &
      (i[1][3] > df2['Col_P'].values) &
      (i[1][3] < df2['Col_Q'].values)]['Col_Z'])

    if bool(value):
        Col_Z.append(value)
        Col_A.append(i[1][0])
        Col_C.append(i[1][2])
    else:
        Col_Z.append(float('NaN'))
        Col_A.append(float('NaN'))
        Col_C.append(float('NaN'))

This code is working fine uptill the condition is met, as soon as condition does'nt met, it throws a TypeError. Please can any rectify this.

Also, I wanted to know if there is any alternate and efficient way to perform it. Please let me know.

Upvotes: 0

Views: 259

Answers (2)

Rawss24
Rawss24

Reputation: 65

After a few trials, I was able to solve my own code. Here it is - the rectified one:

df3 = {}
Col_A = []
Col_C = []
Col_Z = []
for i in df1.iterrows():    
    value = df2[(i[1][1] > df2['Col_X'].values) &
      (i[1][1] < df2['Col_Y'].values) &
      (i[1][3] > df2['Col_P'].values) &
      (i[1][3] < df2['Col_Q'].values)]['Col_Z']

    if value.empty:
        continue
    else:
        Col_Z.append(value)
        Col_A.append(i[1][0])
        Col_C.append(i[1][2])

df3['A'] = Col_A
df3['C'] = Col_C
df3['Z'] = Col_Z
df3 = pd.DataFrame(df3)

However, due to the for loop that is iterating over all the rows, this method may not be efficient for large datasets, say for 100-200 million. Hoping to find some more efficient way!

Upvotes: 0

Alexandre B.
Alexandre B.

Reputation: 5502

New answer:

You have two embbeded loops. The first one is always a full loop. But not the second. So the question is how to improve the efficiency of the sub partial loop.

Here, I give you two ways to perform the second loop:

  • The first considers the dataset as a whole, processing all the data and select the interesting values
  • The second iterates the rows till matching the condition.

This discussion might give you some insights on how to perform the iterations.

# Import module
import numpy as np

df1 = pd.DataFrame([["A", 11,  "B", 20],
                    ["A", 11,  "C", 24],
                    ["B", 14,  "R", 32]],
                   columns=["Col_A", "Col_B", "Col_C", "Col_D"])
df2 = pd.DataFrame([[10, 15,  16, 21, 0.99],
                    [10, 15,  17, 22, 0.89],
                    [11, 15,  30, 35, 0.67]],
                   columns=["Col_X", "Col_Y", "Col_P", "Col_Q", "Col_Z"])

def getCondition(row, df2):
    # Iterate df2 till a row meets the condition
    for _, row_df2 in df2.iterrows():
        if row_df2.Col_X <= row.Col_B and row.Col_B < row_df2.Col_Y \
            and row_df2.Col_P <= row.Col_D and row.Col_D < row_df2.Col_Q:
            return pd.Series([row.Col_A, row.Col_C, row_df2.Col_Z])
    return np.NaN


def getCondition2(row, df2):
    # Find all rows matching the condition and select the first
    condition = ((df2.Col_X <= row.Col_B) & (row.Col_B < df2.Col_Y)\
        & (df2.Col_P <= row.Col_D) & (row.Col_D < df2.Col_Q))
    if sum(condition) > 0:
        return pd.Series([row.Col_A, row.Col_C, df2.Col_Z[condition].iloc[0]])
    return np.NaN


# Apply the condition
output = df1.apply(getCondition2, args=[df2], axis=1)
print(output)
#      0    1     2
# 0    A    B  0.99
# 1  NaN  NaN   NaN
# 2    B    R  0.67

Old answer:

You can do this by considering the data set as a whole.

  • Firstly, for more convenience, I suggest you to join your two dataset as one dataset. You can do it with the merge function or just concat. Here, I use concat since another solution uses merge. To be clear with what there performing, you can have a look at this.
  • Then, you can define you condition on the whole columns. Take care of the and operator that becomes &.
  • Finally, you can call the where function that returns Nan when the condition isn't satisfied.

  • To fit the desired output, you can filter the columns using iloc or just calling the columns name.

Here the code:

# Import module
import pandas as pd

df1 = pd.DataFrame([["A", 11,  "B", 20],
                    ["A", 11,  "C", 24],
                    ["B", 14,  "R", 19]],
                   columns=["Col_A", "Col_B", "Col_C", "Col_D"])
df2 = pd.DataFrame([[10, 15,  16, 21, 0.99],
                    [10, 15,  17, 22, 0.89],
                    [11, 15,  16, 20, 0.67]],
                   columns=["Col_X", "Col_Y", "Col_P", "Col_Q", "Col_Z"])

# Concat the dataframe
df = pd.concat([df1, df2], axis=1)
print(df)

# Define the conditions
condition_col_b = ((df.Col_X <= df.Col_B) & (df.Col_B < df.Col_Y))
condition_col_d = ((df.Col_P <= df.Col_D) & (df.Col_D < df.Col_Q))

print(condition_col_b & condition_col_d)
# 0     True
# 1    False
# 2     True

# Apply the condition
output = df.where(condition_col_b & condition_col_d)
print(output)
#   Col_A  Col_B Col_C  Col_D  Col_X  Col_Y  Col_P  Col_Q  Col_Z
# 0     A   11.0     B   20.0   10.0   15.0   16.0   21.0   0.99
# 1   NaN    NaN   NaN    NaN    NaN    NaN    NaN    NaN    NaN
# 2     B   14.0     R   19.0   11.0   15.0   16.0   20.0   0.67

# Filter output
print(output[['Col_A', 'Col_C', 'Col_Z']])
#   Col_A Col_C  Col_Z
# 0     A     B   0.99
# 1   NaN   NaN    NaN
# 2     B     R   0.67

Upvotes: 2

Related Questions