Reputation: 11
I have a DataFrame with around 300 million rows and two columns. I need to sort the second column, while keeping together all rows that have the same value in the first column, but the order of the first column doesn't matter.
If I started with:
|ID |Date
0|DEF |2000-01-01
1|ABC |2000-01-01
2|DEF |2000-02-01
3|ABC |2000-02-01
I need to end up with either:
|ID |Date
0|DEF |2000-01-01
2|DEF |2000-02-01
1|ABC |2000-01-01
3|ABC |2000-02-01
or
|ID |Date
1|ABC |2000-01-01
3|ABC |2000-02-01
0|DEF |2000-01-01
2|DEF |2000-02-01
But it doesn't matter if I get the first or the second. Running
df.sort_values(by=["ID", "Date"])
on the full set of data takes about 8 minutes, but a part of that time is spent making sure the ID column is in the right order, which doesn't matter to me. The only way I could think to sort Date within groups of ID without sorting ID is
df.groupby("ID", sort=False).apply(lambda x: x.sort_values(by=["Date"])
But the groupby code takes significantly longer to run than just sorting the ID column. On a smaller sample of 10,000,000 rows the groupby sort takes over 10 minutes compared to 17.6 seconds for the two-column sort.
Is there a clever way to sort the second column within groups of the first that would be more efficient than just sorting both columns, or is the two-column sort the fastest I'm going to get?
Upvotes: 1
Views: 578
Reputation: 9
Disclaimer: I work for bodo.ai
I tried your code with Bodo, and here's how it worked out for me.
Generated some data first and stored it in a parquet file (I tested with two dataframe sizes, 1m and 100m rows):
import pandas as pd
import numpy as np
def gen_data(n):
df = pd.DataFrame(
{
"Date" : np.repeat(pd.date_range("2013-01-03", periods=1000), n),
"id" : np.random.choice(['ABC','DEF','GHI'], n * 1000)
}
)
df.to_parquet("pd_example"+str(n)+".pq", row_group_size=100000)
gen_data(100000)
Then I added a Bodo JIT decorator to your code, printed out the head and tail to see what that looks like and added some timers to see how much time Bodo takes.
import bodo
import time
@bodo.jit
def get_output(input_file):
# add timer here to start timing after compilation
t1 = time.time()
df = pd.read_parquet(input_file)
# your original code
df1 = df.sort_values(by=["id", "Date"])
print("time taken:",time.time()-t1)
print(df1.head(10), df1.tail(10))
return df1
Bodo uses MPI to parallelize Pandas code, so I was also able to run this code on 4 cores. Here are my results on a 2020 Macbook Pro:
Rows | Cores | Time Taken |
---|---|---|
1m | 1 | 3.22s |
1m | 4 | 0.97s |
10m | 1 | 41s |
10m | 4 | 12.5s |
Bodo is available as a conda install and free to use on up to 4 cores.
Upvotes: -1