asimo
asimo

Reputation: 2500

groupby on multiple columns and filter and split into separate dataframes

I have a dataframe that looks something like :

LastName Date ObjectCol1 ObjectCol2 ObjectCol3 NumCol1 NumCol2 NumCol3 Intermediate1 Intermediate2
ABC      March                                                            NA              NA
ABC      June                                                             NA              NA
XYZ      March                                                            NA              Danger
XYZ      June                                                             Danger          Danger2
XYZ      July                                                             Danger          NA
AAA      March                                                            NA              NA
AAA      June                                                             NA              NA  
DEF      March                                                            NA              NA
DEF      June                                                             Danger          Danger2
DEF      July                                                             Danger          NA

I want to split this dataframe into two dataframes such that only those LastNames which have "always" (meaning at all dates) have had "NA" (string object) (not null) in the Intermediate1 and 2 columns will be filtered into this 1 dataframe. And the second dataframe will have those records where any of the LastNames has had "Non NA" value (atleast once in the date history) in the Intermediate* columns

So for the sample df above, rows with ABC and AAA will go into one dataframe. and rows with XYZ and DEF should go into another dataframe.

I am looking at groupby and filtering, however the date column handling isnt happening well. Can you suggest anything ?

Upvotes: 1

Views: 57

Answers (1)

jezrael
jezrael

Reputation: 863166

First get all LastName values which not matched missing values by boolean indexing with DataFrame.isna and DataFrame.all and then test original values of LastName with isin and inverted mask by ~ for df1, non inverted for df2:

vals = df.loc[~df[['Intermediate1','Intermediate2']].isna().all(axis=1),'LastName']
mask = df['LastName'].isin(vals)

Alternative solution with GroupBy.transform and GroupBy.any for test if at least one True per group:

m = ~df[['Intermediate1','Intermediate2']].isna().all(axis=1)
mask = m.groupby(df['LastName']).transform('any')
df1 = df[~mask]
df2 = df[mask]

print (df1)
  LastName   Date  ObjectCol1  ObjectCol2  ObjectCol3  NumCol1  NumCol2  \
0      ABC  March         NaN         NaN         NaN      NaN      NaN   
1      ABC   June         NaN         NaN         NaN      NaN      NaN   
5      AAA  March         NaN         NaN         NaN      NaN      NaN   
6      AAA   June         NaN         NaN         NaN      NaN      NaN   

   NumCol3 Intermediate1 Intermediate2  
0      NaN           NaN           NaN  
1      NaN           NaN           NaN  
5      NaN           NaN           NaN  
6      NaN           NaN           NaN 

print (df2)
  LastName   Date  ObjectCol1  ObjectCol2  ObjectCol3  NumCol1  NumCol2  \
2      XYZ  March         NaN         NaN         NaN      NaN      NaN   
3      XYZ   June         NaN         NaN         NaN      NaN      NaN   
4      XYZ   July         NaN         NaN         NaN      NaN      NaN   
7      DEF  March         NaN         NaN         NaN      NaN      NaN   
8      DEF   June         NaN         NaN         NaN      NaN      NaN   
9      DEF   July         NaN         NaN         NaN      NaN      NaN   

   NumCol3 Intermediate1 Intermediate2  
2      NaN           NaN        Danger  
3      NaN        Danger       Danger2  
4      NaN        Danger           NaN  
7      NaN           NaN           NaN  
8      NaN        Danger       Danger2  
9      NaN        Danger           NaN  

Upvotes: 1

Related Questions