Reputation: 503
How can I sort a DataFrame so that rows in the duplicate column are "recycled"?
For example, my original DataFrame looks like this:
In [3]: df
Out[3]:
A B
0 r1 0
1 r1 1
2 r2 2
3 r2 3
4 r3 4
5 r3 5
I would like it to turn to:
In [3]: df_sorted
Out[3]:
A B
0 r1 0
2 r2 2
4 r3 4
1 r1 1
3 r2 3
5 r3 5
Rows are sorted such that rows in columns A
are in a "recycled" fashion.
I have searched APIs in Pandas, but it seems there isn't any proper method to do so. I can write a complicated function to accomplish this, but just wondering is there any smart way or existing pandas method can do this? Thanks a lot in advance.
Update:
Apologies for a wrong statement. In my real problem, column B
contains string values.
Upvotes: 7
Views: 14340
Reputation: 32095
You can formulate your need as taking the first row of each group, then the second line, then the thrid, etc. So this is equivalent to group your results by row per group and then per 'A'
.
You can number your rows per key in 'A'
with the function rank
. Apply this function on each group and you are done:
df['C'] = df.groupby('A')['B'].rank()
df
Out[8]:
A B C
0 r1 0 1.0
1 r1 1 2.0
2 r2 2 1.0
3 r2 3 2.0
4 r3 4 1.0
5 r3 5 2.0
df.sort_values(['C', 'A'])
Out[9]:
A B C
0 r1 0 1.0
2 r2 2 1.0
4 r3 4 1.0
1 r1 1 2.0
3 r2 3 2.0
5 r3 5 2.0
You drop 'C'
if you don't need it.
Edit to follow up on a comment
I take for granted in your sample 'B'
is your index column. If it's not then you need to work on the index itself:
df['C'] = df.reset_index().groupby('A')['index'].rank()
Upvotes: 3
Reputation: 862661
You can use cumcount
for counting duplicates in column A
, then sort_values
first by A
(in sample not necessary, in real data maybe important) and then by C
. Last remove column C
by drop
:
df['C'] = df.groupby('A')['A'].cumcount()
df.sort_values(by=['C', 'A'], inplace=True)
print (df)
A B C
0 r1 0 0
2 r2 2 0
4 r3 4 0
1 r1 1 1
3 r2 3 1
5 r3 5 1
df.drop('C', axis=1, inplace=True)
print (df)
A B
0 r1 0
2 r2 2
4 r3 4
1 r1 1
3 r2 3
5 r3 5
Timings:
Small df (len(df)=6
)
In [26]: %timeit (jez(df))
1000 loops, best of 3: 2 ms per loop
In [27]: %timeit (boud(df1))
100 loops, best of 3: 2.52 ms per loop
Large df (len(df)=6000
)
In [23]: %timeit (jez(df))
100 loops, best of 3: 3.44 ms per loop
In [28]: %timeit (boud(df1))
100 loops, best of 3: 2.52 ms per loop
Code for timing:
df = pd.concat([df]*1000).reset_index(drop=True)
df1 = df.copy()
def jez(df):
df['C'] = df.groupby('A')['A'].cumcount()
df.sort_values(by=['C', 'A'], inplace=True)
df.drop('C', axis=1, inplace=True)
return (df)
def boud(df):
df['C'] = df.groupby('A')['B'].rank()
df = df.sort_values(['C', 'A'])
df.drop('C', axis=1, inplace=True)
return (df)
100 loops, best of 3: 4.29 ms per loop
Upvotes: 8