Pandas most efficient way to do a multicolumn sort if the order of the first column doesn't matter

Question

I have a DataFrame with around 300 million rows and two columns. I need to sort the second column, while keeping together all rows that have the same value in the first column, but the order of the first column doesn't matter.

If I started with:

 |ID  |Date
0|DEF |2000-01-01
1|ABC |2000-01-01
2|DEF |2000-02-01
3|ABC |2000-02-01

I need to end up with either:

 |ID  |Date
0|DEF |2000-01-01
2|DEF |2000-02-01
1|ABC |2000-01-01
3|ABC |2000-02-01

or

 |ID  |Date
1|ABC |2000-01-01
3|ABC |2000-02-01
0|DEF |2000-01-01
2|DEF |2000-02-01

But it doesn't matter if I get the first or the second. Running

df.sort_values(by=["ID", "Date"])

on the full set of data takes about 8 minutes, but a part of that time is spent making sure the ID column is in the right order, which doesn't matter to me. The only way I could think to sort Date within groups of ID without sorting ID is

df.groupby("ID", sort=False).apply(lambda x: x.sort_values(by=["Date"])

But the groupby code takes significantly longer to run than just sorting the ID column. On a smaller sample of 10,000,000 rows the groupby sort takes over 10 minutes compared to 17.6 seconds for the two-column sort.

Is there a clever way to sort the second column within groups of the first that would be more efficient than just sorting both columns, or is the two-column sort the fastest I'm going to get?

Rows	Cores	Time Taken
1m	1	3.22s
1m	4	0.97s
10m	1	41s
10m	4	12.5s

Pandas most efficient way to do a multicolumn sort if the order of the first column doesn't matter

Answers (1)

Related Questions

Pandas most efficient way to do a multicolumn sort if the order of the first column doesn&#39;t matter

Answers (1)

Related Questions

Pandas most efficient way to do a multicolumn sort if the order of the first column doesn't matter