Xer
Xer

Reputation: 503

Sort rows of DataFrame by duplicate

How can I sort a DataFrame so that rows in the duplicate column are "recycled"?

For example, my original DataFrame looks like this:

In [3]: df
Out[3]: 
    A  B
0  r1  0
1  r1  1
2  r2  2
3  r2  3
4  r3  4
5  r3  5

I would like it to turn to:

In [3]: df_sorted
Out[3]: 
    A  B
0  r1  0
2  r2  2
4  r3  4
1  r1  1
3  r2  3
5  r3  5

Rows are sorted such that rows in columns A are in a "recycled" fashion.

I have searched APIs in Pandas, but it seems there isn't any proper method to do so. I can write a complicated function to accomplish this, but just wondering is there any smart way or existing pandas method can do this? Thanks a lot in advance.

Update: Apologies for a wrong statement. In my real problem, column B contains string values.

Upvotes: 7

Views: 14340

Answers (2)

Zeugma
Zeugma

Reputation: 32095

You can formulate your need as taking the first row of each group, then the second line, then the thrid, etc. So this is equivalent to group your results by row per group and then per 'A' .

You can number your rows per key in 'A' with the function rank. Apply this function on each group and you are done:

df['C'] = df.groupby('A')['B'].rank()

df
Out[8]: 
    A  B    C
0  r1  0  1.0
1  r1  1  2.0
2  r2  2  1.0
3  r2  3  2.0
4  r3  4  1.0
5  r3  5  2.0

df.sort_values(['C', 'A'])
Out[9]: 
    A  B    C
0  r1  0  1.0
2  r2  2  1.0
4  r3  4  1.0
1  r1  1  2.0
3  r2  3  2.0
5  r3  5  2.0

You drop 'C' if you don't need it.


Edit to follow up on a comment

I take for granted in your sample 'B' is your index column. If it's not then you need to work on the index itself:

df['C'] = df.reset_index().groupby('A')['index'].rank()

Upvotes: 3

jezrael
jezrael

Reputation: 862661

You can use cumcount for counting duplicates in column A, then sort_values first by A (in sample not necessary, in real data maybe important) and then by C. Last remove column C by drop:

df['C'] = df.groupby('A')['A'].cumcount()
df.sort_values(by=['C', 'A'], inplace=True)
print (df)
    A  B  C
0  r1  0  0
2  r2  2  0
4  r3  4  0
1  r1  1  1
3  r2  3  1
5  r3  5  1

df.drop('C', axis=1, inplace=True)
print (df)
    A  B
0  r1  0
2  r2  2
4  r3  4
1  r1  1
3  r2  3
5  r3  5

Timings:

Small df (len(df)=6)

In [26]: %timeit (jez(df))
1000 loops, best of 3: 2 ms per loop

In [27]: %timeit (boud(df1))
100 loops, best of 3: 2.52 ms per loop

Large df (len(df)=6000)

In [23]: %timeit (jez(df))
100 loops, best of 3: 3.44 ms per loop

In [28]: %timeit (boud(df1))
100 loops, best of 3: 2.52 ms per loop

Code for timing:

df = pd.concat([df]*1000).reset_index(drop=True) 
df1 = df.copy()

def jez(df):
    df['C'] = df.groupby('A')['A'].cumcount()
    df.sort_values(by=['C', 'A'], inplace=True)
    df.drop('C', axis=1, inplace=True)
    return (df)

def boud(df):
    df['C'] = df.groupby('A')['B'].rank()
    df = df.sort_values(['C', 'A'])
    df.drop('C', axis=1, inplace=True)
    return (df)
100 loops, best of 3: 4.29 ms per loop

Upvotes: 8

Related Questions