Reputation: 6464
I have a dataframe of the format
id time a b
01 01 1 0
01 01 0 1
01 01 0 0
01 02 0 0
01 02 1 0
01 03 1 0
01 03 0 1
So now, the output should be
id time a b
01 01 1 1
01 02 1 0
01 03 1 1
Here, I am basically merging all rows with same id
and time
such the the values in other rows are max of all the values for that id
and time
.
I am currently doing
df = df.groupby(['id','time']).max()
However, this is taking a lot of time (>10 mins) as the number of rows and columns is large. I am wondering if there is a more efficient way to do the same thing!
Upvotes: 1
Views: 1657
Reputation: 42885
You may want to take advantage of multiprocessing
if your hardware allows you to work on several cores simultaneously, and parallelize the calculation of the max
by group:
Using a random sample with 25 columns and ['id', 'time']
as MultiIndex
::
cols = {'id': np.random.randint(1, 11, 10000), 'time': np.random.randint(1, 11, 10000)}
cols.update({k: np.random.random(size=10000) for k in range(25)})
df = pd.DataFrame(cols).set_index(['id', 'time'])
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 10000 entries, (4, 9) to (3, 4)
Data columns (total 25 columns):
0 10000 non-null float64
1 10000 non-null float64
2 10000 non-null float64
3 10000 non-null float64
4 10000 non-null float64
....
20 10000 non-null float64
21 10000 non-null float64
22 10000 non-null float64
23 10000 non-null float64
24 10000 non-null float64
dtypes: float64(25)
memory usage: 2.0+ MB
Next, grouping by ['id', 'time']
levels and capturing group ids:
grps = df.groupby(level=['id', 'time'])
index = [grp[0] for grp in grps]
Resulting in 100 groups:
'# Groups: ', len([grp[0] for grp in grps])
# Groups: 100
Finally, setting up a Pool
with 8 workers (# cores), and running the 100 groups through the pool via functools.partial
to pass along the axis=0
parameter:
from multiprocessing import Pool
from functools import partial
with Pool(processes=8) as pool:
imap_res = pool.imap(partial(np.amax, axis=0), [grp[1] for grp in grps])
Concatenating the result back into DataFrame
via list comprehension
:
result = pd.concat([pd.Series(res) for res in imap_res], axis=1).T.sort_index(axis=1)
result.index = index
<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, (1, 1) to (10, 10)
Data columns (total 25 columns):
0 100 non-null float64
1 100 non-null float64
2 100 non-null float64
3 100 non-null float64
4 100 non-null float64
....
20 100 non-null float64
21 100 non-null float64
22 100 non-null float64
23 100 non-null float64
24 100 non-null float64
dtypes: float64(25)
memory usage: 20.3+ KB
Upvotes: 1