Dogod
Dogod

Reputation: 303

How to split dataframe or reorder dataframe by rows in pandas

I just want to clean the dataframe and analyse the dataframe. However, I got in trouble. I created a simple dataframe to illustrate it:

import pandas as pd
d = {'Resutls': ['IIL', 'pass','pass','IIH','pass','IIL','pass'], 'part':['None',1,2,'None',5,'None',4] }
df = pd.DataFrame(d)

the result looks like:

    Resutls  part
0     IIL    None
1    pass      1
2    pass      2
3     IIH    None
4    pass      5
5     IIL    None
6    pass      4

There are some repeatable modules in the dataframe. I just want to reorder the dataframe by rows and drop the duplicated ones like:

    Resutls  part
0     IIL    None
1    pass      1
2    pass      2
6    pass      4 
3     IIH    None
4    pass      5

or just split the dataframe into several sub dataframes:

    Resutls  part
0     IIL    None
1    pass      1
2    pass      2
3    pass      4 

    Resutls  part
0     IIH    None
1    pass      5

This is just an easy example what I want to do. Actually I have a 4000-thousand rows dataframe, I tried to use reindex or df.iloc to do this. It is intuitive for me but seems a little complicated to achieve. Is there any good way to do this? Please advise.

Upvotes: 1

Views: 107

Answers (1)

jezrael
jezrael

Reputation: 862921

I think you need replace pass to NaNs and use forward filling, then sort by argsort and reorder by iloc:

df = df.iloc[df['Resutls'].mask(df['Resutls'].eq('pass')).ffill().argsort()]
print (df)
  Resutls  part
3     IIH  None
4    pass     5
0     IIL  None
1    pass     1
2    pass     2
5     IIL  None
6    pass     4

Last remove repeating rows by boolean indexing:

df = df[~df['Resutls'].duplicated() | (df['Resutls'] == 'pass')]
print (df)
  Resutls  part
3     IIH  None
4    pass     5
0     IIL  None
1    pass     1
2    pass     2
6    pass     4

If want each DataFrame separately:

df['g'] = df['Resutls'].mask(df['Resutls'].eq('pass')).ffill()
df = df[~df['Resutls'].duplicated() | (df['Resutls'] == 'pass')]
print (df)
  Resutls  part    g
0     IIL  None  IIL
1    pass     1  IIL
2    pass     2  IIL
3     IIH  None  IIH
4    pass     5  IIH
6    pass     4  IIL

dfs = {k:v.drop('g', axis=1) for k, v in df.groupby('g')}
#print (dfs)

print (dfs['IIH'])
  Resutls  part
3     IIH  None
4    pass     5

print (dfs['IIL'])
  Resutls  part
0     IIL  None
1    pass     1
2    pass     2
6    pass     4

Upvotes: 1

Related Questions