Reputation: 581
I have a large-ish pandas dataframe with multiple columns (c1 ... c8) and ~32 mil rows. The dataframe is already sorted by c1. I want to grab other column values from rows that share a particular value of c1.
something like
keys = big_df['c1'].unique()
red = np.zeros(len(keys))
for i, key in enumerate(keys):
inds = (big_df['c1'] == key)
v1 = np.array(big_df.loc[inds]['c2'])
v2 = np.array(big_df.loc[inds]['c6'])
red[i] = reduce_fun(v1,v2)
However this turns out to be very slow I think because it checks the entire columns for the matching criterion (even though there might only be 10 rows out of 32 mil that are relevant). Since big_df is sorted by c1 and the keys is just the list of all unique c1's, is there a fast way to get the red[] array (ie i know the first row with the next key is the row after the last row of the previous key, I know that the last row for a key is the last row that matches the key, since all subsequent rows are guaranteed not to match).
Thanks,
Ilya
Edit: I am not sure what order unique() method produces, but I basically want to have for every key in keys a value of reduce_fun(), I don't particularly care what order they are (presumably the easiest order is the order c1 is already sorted in).
Edit2: I slightly restructured the code. Basically, is there an efficient way of constructing inds. big_df['c1'] == key takes 75.8% of total time in my data, while creating v1, v2 takes 21.6% according to line profiler.
Upvotes: 2
Views: 1200
Reputation: 4558
How about a groupby
statement in a list comprehension? This should be especially efficient given the DataFrame
is already sorted by c1
:
Edit: Forgot that groupby
returns a tuple. Oops!
red = [reduce_fun(g['c2'].values, g['c6'].values) for i, g in big_df.groupby('c1', sort=False)]
Seems to chug through pretty quickly for me (~2 seconds for 30 million random rows and a trivial reduce_fun).
Upvotes: 2
Reputation: 109626
Rather than a list, I chose a dictionary to hold the reduced values keyed on each item in c1
.
red = {key: reduce_func(frame['c2'].values, frame['c7'].values)
for key, frame in df.groupby('c1')}
Upvotes: 6