Mop
Mop

Reputation: 87

Delete rows with conditions (Multi-Index case)

I'm new to Stack Overflow and I have this Data Set :

df=pd.DataFrame({'ID': {0: 4, 1: 4, 2: 4, 3: 88, 4: 88, 5: 323, 6: 323},
         'Step': {0: 'A', 1: 'Bar', 2: 'F', 3: 'F', 4: 'Bar', 5: 'F', 6: 'A'},
         'Num': {0: 38, 1: 38, 2: 38, 3: 320, 4: 320, 5: 433, 6: 432},
         'Date': {0: '2018-08-02',
          1: '2018-12-02',
          2: '2019-03-02',
          3: '2017-03-02',
          4: '2018-03-02',
          5: '2020-03-04',
          6: '2020-02-03'},
         'Occurence': {0: 3, 1: 3, 2: 3, 3: 2, 4: 2, 5: 2, 6: 2}})

The variables 'ID' and 'Step' are Multi-index.

I would like to do two things :

FIRST :

If 'Num' is different for the same 'ID', then delete the rows of this ID.

SECONDLY :

For a same ID, the step 'F' should be the last one (with the most recent date). If not, then delete the rows of this ID.

I have some difficulties because the commands df['Step'] and df['ID'] are NOT WORKING ('ID' and 'Step' are Multi-Index cause of a recent groupby() ).

I've tried groupby(level=0) that I found on Multi index dataframe delete row with maximum value per group

But I still have some difficulties.

Could someone please help me?

Expected Output :

df=pd.DataFrame({'ID': {0: 4, 1: 4, 2: 4},
         'Step': {0: 'A', 1: 'Bar', 2: 'F'},
         'Num': {0: 38, 1: 38, 2: 38},
         'Date': {0: '2018-08-02',
          1: '2018-12-02',
          2: '2019-03-02',
         'Occurence': {0: 3, 1: 3, 2: 3}})

The ID 88 has been removed because the step 'F' was not the last one step (with the most recent date). The ID 323 has been removed because Num 433!=Num 432.

Upvotes: 3

Views: 253

Answers (4)

use groupby to find the rows with 1 occurrence. I drop the rows in the dataframe based on the ID return by the groupby results. I exclude IDs with one occurrence and not include those in the deletion.

df=pd.DataFrame({'ID': {0: 4, 1: 4, 2: 4, 3: 88, 4: 88, 5: 323, 6: 323},
     'Step': {0: 'A', 1: 'Bar', 2: 'F', 3: 'F', 4: 'Bar', 5: 'F', 6: 'A'},
     'Num': {0: 38, 1: 38, 2: 38, 3: 320, 4: 320, 5: 433, 6: 432},
     'Date': {0: '2018-08-02',
      1: '2018-12-02',
      2: '2019-03-02',
      3: '2017-03-02',
      4: '2018-03-02',
      5: '2020-03-04',
      6: '2020-02-03'},
     'Occurence': {0: 3, 1: 3, 2: 3, 3: 2, 4: 2, 5: 2, 6: 2}})
df.set_index(['ID','Step'],inplace=True)
print(df)
print("If 'Num' is different for the same 'ID', then delete the rows of this ID.")

 #exclude id with single occurrences
 grouped=df.groupby([df.index.get_level_values(0)]).size().eq(1)
 labels=set([x for x,y in (grouped[grouped.values==True].index)])

 filter=[x for x in df.index.get_level_values(0) if x not in labels]

 grouped = df[df.index.get_level_values(0).isin(filter)].groupby([df.index.get_level_values(0),'Num']).size().eq(1)

 labels=set([x for x,y in (grouped[grouped.values==True].index)])
 if len(labels)>0:
     df = df.drop(labels=labels, axis=0,level=0)    
 print(df)

output:

  Num        Date  Occurence
  ID Step                            
  4  A      38  2018-08-02          3
     Bar    38  2018-12-02          3
     F      38  2019-03-02          3
  88 F     320  2017-03-02          2
     Bar   320  2018-03-02          2

Upvotes: 0

Scott Boston
Scott Boston

Reputation: 153460

Since you stated that ID and Step are in the index, we can do it this way:

df1[df1.sort_values('Date').groupby('ID')['Num']\
       .transform(lambda x: (x.nunique() == 1) & 
                            (x.index.get_level_values(1)[-1] == 'F'))]

Output:

         Num        Date  Occurence
ID Step                            
4  A      38  2018-08-02          3
   Bar    38  2018-12-02          3
   F      38  2019-03-02          3

How?

  • First sort the dataframe by 'Date'
  • Then group the dataframe by ID
  • Taking each group of the dataframe and using the 'Num' column to transform in a boolean series, we first get the number of unique elements of 'Num' in that group, if that number is equal to 1, then you know that in that group all 'Num's are the same and that is True
  • Secondly, and we get the inner level of the MultiIndex (level=1) and we check the last value using indexing with [-1], if that value is equal to 'F' then have a True also

Upvotes: 2

Shubham Sharma
Shubham Sharma

Reputation: 71689

  • Group the dataframe by column ID
  • Transform the Num column using nunique to identify the unique values
  • Transform the Step column using last to check whether the last value per group is F
  • Combine the boolean masks using logical and and filter the rows
g = df.groupby('ID')
m = g['Num'].transform('nunique').eq(1) & g['Step'].transform('last').eq('F')
print(df[m])

   ID Step  Num        Date  Occurence
0   4    A   38  2018-08-02          3
1   4  Bar   38  2018-12-02          3
2   4    F   38  2019-03-02          3

Alternative approach with groupby and filter but could be less efficient than the above approach

df.groupby('ID').filter(lambda g: g['Step'].iloc[-1] == 'F' and g['Num'].nunique() == 1)

   ID Step  Num        Date  Occurence
0   4    A   38  2018-08-02          3
1   4  Bar   38  2018-12-02          3
2   4    F   38  2019-03-02          3

Note: In case ID and Step are MultiIndex you have to reset the index before using the above proposed solutions.

Upvotes: 2

TJVerne
TJVerne

Reputation: 75

I don't know if I understood correctly. But you can try this

import os
import pandas as pd 

sheet = pd.read_excel(io="you_file", sheet_name='sheet_name', na_filter=False, header=0 )

list_objects          = []

for index,row in sheet.iterrows():
    if (row['ID'] != index):
        list_objects.append(row)

list_objects will be a list of dict

Upvotes: 0

Related Questions