Marc Schwambach
Marc Schwambach

Reputation: 438

Extracting elements of a dataframe based on conditional of other dataframes

I believe that my problem is really straightforward and there must be a really easy way to solve this issue, however as I am quite new with Python, specially pandas, I could not sort it out by my own.

I made up the following dataframes, which represents a way simpler scenario of what I have been working on. It might be scary the lenght of the explanation, but I am just trying to make it as clear as possible. So, please, consider:

import pandas as pd

data = pd.DataFrame({'Column 1': [300,300,450,500,500,750,600,300, 150],
'Column 2': [100,130,230,200,300,350,600,550,530], 
'Column 3': [250, 300, 400, 500, 700, 
350, 750, 550, 600]})

df_validate_1 = pd.DataFrame({'Column 1': [1,0,1,1,1,1,1,1, 1],
'Column 2': [1,1,1,1,1,0,0,1,1], 
'Column 3': [0, 1, 1, 1, 1, 
1, 1, 1, 0]})

df_validate_2 = pd.DataFrame({'Column 1': [1,1,1,1,1,0,1,1,1],
'Column 2': [1,1,0,1,1,1,0,0,0], 
'Column 3': [0, 0, 1, 1, 0, 
1, 1, 0, 0]})

df_validate_3 = pd.DataFrame({'Column 1': [1,1,1,0,0,1,1,1,1],
'Column 2': [1,1,1,1,1,1,1,0,0], 
'Column 3': [0, 0, 0, 1, 1, 
1, 1, 1, 1]})

I have one main dataframe, and 3 different dataframes for validation, which are only 1 or 0 (true or false). What I am looking for is to find a way in which I will creat one dataframe out of the dataframe 'data' which consists of a conditional of the other three validate dataframes, where I will count only the elements out of the 'data' dataframe whose the respective column and row correspond of 1 for the three other dataframes, if there is one zero in any of the respective position on the validate dataframes, I will consider it NaN.

Explaining briefly what I want in pratice: The first element of the column 2 in the 'data' dataframe is 100. For the df_validate_1, df_validate_2 and df_validate_3 the first element of the second columns is 1. Hence I will pass the value to my new dataframe. On the other hand, the second element of the column 1 is 300, however for that case the element on the respective position at the df_validate_1 is 0 (even if for the same position the df_validate_2 and df_validate_3 is 1), therefore, I want to write it as a NaN.

I am expecting the following results:

  Column 1 Column 2 Column 3
0      300      100      NaN
1      NaN      130      NaN
2      450      NaN      NaN
3      NaN      200      500
4      NaN      300      NaN
5      NaN      NaN      350
6      600      NaN      750
7      300      NaN      NaN
8      150      NaN      NaN

Naturally I am facing a way more complex solution and this is a made up example. Henceforth, I am looking for a elegant solution which I can apply in a more robust scenario.

Hope that I managed to be succinct and precise. I would really appreciate your help on this one!

Upvotes: 1

Views: 251

Answers (3)

Joe
Joe

Reputation: 889

You can use np.where():

>>> result_df = pd.DataFrame()

>>> result_df['Column 1'] = np.where((df_validate_1['Column 1'] == 1) & (df_validate_2['Column 1'] == 1) & (df_validate_3['Column 1'] == 1), data['Column 1'], [np.nan])

>>> result_df['Column 2'] = np.where((df_validate_1['Column 2'] == 1) & (df_validate_2['Column 2'] == 1) & (df_validate_3['Column 2'] == 1), data['Column 2'], [np.nan])

>>> result_df['Column 3'] = np.where((df_validate_1['Column 3'] == 1) & (df_validate_2['Column 3'] == 1) & (df_validate_3['Column 3'] == 1), data['Column 3'], [np.nan])

>>> result_df

   Column 1  Column 2  Column 3
0     300.0     100.0       NaN
1       NaN     130.0       NaN
2     450.0       NaN       NaN
3       NaN     200.0     500.0
4       NaN     300.0       NaN
5       NaN       NaN     350.0
6     600.0       NaN     750.0
7     300.0       NaN       NaN
8     150.0       NaN       NaN

Upvotes: 0

manwithfewneeds
manwithfewneeds

Reputation: 1167

I'm using pd.concat, then using a groupby to get the min value per index which will represent either 0 or 1. This can be used as a mask for the original df.

dfm = pd.concat([df1,df2,df3])
df = df[dfm.groupby(dfm.index).min().astype(bool)]

print(df)
   Column 1  Column 2  Column 3
0     300.0     100.0       NaN
1       NaN     130.0       NaN
2     450.0       NaN       NaN
3       NaN     200.0     500.0
4       NaN     300.0       NaN
5       NaN       NaN     350.0
6     600.0       NaN     750.0
7     300.0       NaN       NaN
8     150.0       NaN       NaN

Upvotes: 1

Trenton McKinney
Trenton McKinney

Reputation: 62383

Given your DataFrames:

  • Multiple them all together
    • Leave the 0 or replace them with np.nan
import pandas as pd
import numpy as np

data_updated = data * df_validate_1 * df_validate_2 * df_validate_3

data_updated.replace(0, np.NaN, inplace=True)

Output:

 Column 1  Column 2  Column 3
    300.0     100.0       NaN
      NaN     130.0       NaN
    450.0       NaN       NaN
      NaN     200.0     500.0
      NaN     300.0       NaN
      NaN       NaN     350.0
    600.0       NaN     750.0
    300.0       NaN       NaN
    150.0       NaN       NaN

Upvotes: 1

Related Questions