Jon Hearn
Jon Hearn

Reputation: 11

Pandas most efficient way to do a multicolumn sort if the order of the first column doesn't matter

I have a DataFrame with around 300 million rows and two columns. I need to sort the second column, while keeping together all rows that have the same value in the first column, but the order of the first column doesn't matter.

If I started with:

 |ID  |Date
0|DEF |2000-01-01
1|ABC |2000-01-01
2|DEF |2000-02-01
3|ABC |2000-02-01

I need to end up with either:

 |ID  |Date
0|DEF |2000-01-01
2|DEF |2000-02-01
1|ABC |2000-01-01
3|ABC |2000-02-01

or

 |ID  |Date
1|ABC |2000-01-01
3|ABC |2000-02-01
0|DEF |2000-01-01
2|DEF |2000-02-01

But it doesn't matter if I get the first or the second. Running

df.sort_values(by=["ID", "Date"])

on the full set of data takes about 8 minutes, but a part of that time is spent making sure the ID column is in the right order, which doesn't matter to me. The only way I could think to sort Date within groups of ID without sorting ID is

df.groupby("ID", sort=False).apply(lambda x: x.sort_values(by=["Date"])

But the groupby code takes significantly longer to run than just sorting the ID column. On a smaller sample of 10,000,000 rows the groupby sort takes over 10 minutes compared to 17.6 seconds for the two-column sort.

Is there a clever way to sort the second column within groups of the first that would be more efficient than just sorting both columns, or is the two-column sort the fastest I'm going to get?

Upvotes: 1

Views: 578

Answers (1)

CapitalDot
CapitalDot

Reputation: 9

Disclaimer: I work for bodo.ai

I tried your code with Bodo, and here's how it worked out for me.

Generated some data first and stored it in a parquet file (I tested with two dataframe sizes, 1m and 100m rows):


import pandas as pd
import numpy as np



def gen_data(n):
    df = pd.DataFrame(
        {
        "Date" : np.repeat(pd.date_range("2013-01-03", periods=1000), n),
        "id" : np.random.choice(['ABC','DEF','GHI'], n * 1000)
        }
        )
    df.to_parquet("pd_example"+str(n)+".pq", row_group_size=100000)

gen_data(100000)

Then I added a Bodo JIT decorator to your code, printed out the head and tail to see what that looks like and added some timers to see how much time Bodo takes.


import bodo
import time

@bodo.jit
def get_output(input_file):

    # add timer here to start timing after compilation
    t1 = time.time()

    df = pd.read_parquet(input_file)
    # your original code
    df1 = df.sort_values(by=["id", "Date"]) 

    print("time taken:",time.time()-t1)
    print(df1.head(10), df1.tail(10)) 

    return df1

Bodo uses MPI to parallelize Pandas code, so I was also able to run this code on 4 cores. Here are my results on a 2020 Macbook Pro:

Rows Cores Time Taken
1m 1 3.22s
1m 4 0.97s
10m 1 41s
10m 4 12.5s

Bodo is available as a conda install and free to use on up to 4 cores.

Upvotes: -1

Related Questions