Pingpong
Pingpong

Reputation: 8009

Select data from two dataframes into two dataframes with Pandas DataFrame and Python

I have two dataframes, like below

rows_df

  Id Location Age
0  0           30
1  1       US  20
2  2       

requiredCols_df

  RequiredColumn
0       Location

requiredCols_df specifies which column is required in rows_df. In this example, Location is required, Age is optional in rows_df.

I want to filter the rows_df based on the requiredCols_df. So there will be two resulting dataframes. One contains rows that have the required columns, and the other dataframe contains rows that don't have any required columns.

Expected Result

Rows matched

  Id Location Age
1  1       US  20

Rows don't match

  Id Location Age
0  0           30
2  2

Please note that:

1.The rows_df contains more than two columns, e.g. 10-30 columns.

2 The requiredCols_df contains more than one row.

3 Please note that Location contains a ' ' (empty space(s)) in row 0, and a null (empty) in row 2.

rows_df = pd.DataFrame({'Id':['0','1','2'],
                    'Location': [' ', 'US', None], 
                        'Age':['30','20','']})

column names below specify which column in rows_df must be not empty

requiredCols_df = pd.DataFrame([['Location']],
                            columns= ['RequiredColumn'])

4 Update: Both the resulting DataFrame contain the original non-changed values.

I can do this with a loop, but I want to see if there is a better solution.

Upvotes: 2

Views: 57

Answers (3)

BeRT2me
BeRT2me

Reputation: 13242

If you correct those to be true nans...

req_cols = requiredCols_df.RequiredColumn
rows_df[req_cols] = rows_df[req_cols].replace([r'^\s*$', '', None], np.nan, regex=True)

  Id Location Age
0  0      NaN  30
1  1       US  20
2  2      NaN

Then this is simple:

matched = rows_df.dropna(subset=req_cols)
not_matched = rows_df[~rows_df.eq(matched).all(axis=1)]
print(matched)
print(not_matched)

# Output:
  Id Location Age
1  1       US  20

  Id Location Age
0  0      NaN  30
2  2      NaN

Generalized:

rows_df = rows_df.replace([r'^\s*$', '', None], np.nan, regex=True)
matched = rows_df.dropna(subset=requiredCols_df.RequiredColumn)
not_matched = rows_df[~rows_df.isin(matched).all(axis=1)]

Upvotes: 1

ringo
ringo

Reputation: 1178

With this setup:

import numpy as np
import pandas as pd


rows_df = pd.DataFrame(
    {
        "Id": [0, 1, 2],
        "Location": [np.NaN, "US", np.NaN],
        "Age": [30, 20, np.NaN]
    }
)

requiredCols_df = pd.DataFrame(
    {
        "RequiredColumn": ["Location"]
    }
)

We can easily form a boolean series to select what we want:

required_column_not_null = rows_df[requiredCols_df["RequiredColumn"]].notnull().all(axis=1)

rows_df[required_column_not_null]
   Id Location   Age
1   1       US  20.0

required_column_null = ~required_column_not_null

rows_df[required_column_null]
   Id Location   Age
0   0      NaN  30.0
2   2      NaN   NaN

Upvotes: 0

ti7
ti7

Reputation: 18836

Assuming you don't have a tremendous count of columns are there, this seems like a task for a loop

df_result = rows_df
for column in requiredCols_df["RequiredColumn"]:
    df_result = df_result[df_result[column].notnull()]

Upvotes: 0

Related Questions