Peeyush
Peeyush

Reputation: 6464

Merge rows with same id and time in pandas

I have a dataframe of the format

id  time  a  b
01    01  1  0
01    01  0  1
01    01  0  0
01    02  0  0
01    02  1  0
01    03  1  0
01    03  0  1

So now, the output should be

id  time  a  b
01    01  1  1
01    02  1  0
01    03  1  1

Here, I am basically merging all rows with same id and time such the the values in other rows are max of all the values for that id and time.

I am currently doing

df = df.groupby(['id','time']).max()

However, this is taking a lot of time (>10 mins) as the number of rows and columns is large. I am wondering if there is a more efficient way to do the same thing!

Upvotes: 1

Views: 1657

Answers (1)

Stefan
Stefan

Reputation: 42885

You may want to take advantage of multiprocessing if your hardware allows you to work on several cores simultaneously, and parallelize the calculation of the max by group:

Using a random sample with 25 columns and ['id', 'time'] as MultiIndex::

cols = {'id': np.random.randint(1, 11, 10000), 'time': np.random.randint(1, 11, 10000)}
cols.update({k: np.random.random(size=10000) for k in range(25)})
df = pd.DataFrame(cols).set_index(['id', 'time'])

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 10000 entries, (4, 9) to (3, 4)
Data columns (total 25 columns):
0     10000 non-null float64
1     10000 non-null float64
2     10000 non-null float64
3     10000 non-null float64
4     10000 non-null float64
....
20    10000 non-null float64
21    10000 non-null float64
22    10000 non-null float64
23    10000 non-null float64
24    10000 non-null float64
dtypes: float64(25)
memory usage: 2.0+ MB

Next, grouping by ['id', 'time'] levels and capturing group ids:

grps = df.groupby(level=['id', 'time'])
index = [grp[0] for grp in grps]

Resulting in 100 groups:

'# Groups: ', len([grp[0] for grp in grps])

# Groups:  100

Finally, setting up a Pool with 8 workers (# cores), and running the 100 groups through the pool via functools.partial to pass along the axis=0 parameter:

from multiprocessing import Pool
from functools import partial

with Pool(processes=8) as pool:
    imap_res = pool.imap(partial(np.amax, axis=0), [grp[1] for grp in grps])

Concatenating the result back into DataFrame via list comprehension:

    result = pd.concat([pd.Series(res) for res in imap_res], axis=1).T.sort_index(axis=1)
    result.index = index

<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, (1, 1) to (10, 10)
Data columns (total 25 columns):
0     100 non-null float64
1     100 non-null float64
2     100 non-null float64
3     100 non-null float64
4     100 non-null float64
....
20    100 non-null float64
21    100 non-null float64
22    100 non-null float64
23    100 non-null float64
24    100 non-null float64
dtypes: float64(25)
memory usage: 20.3+ KB

Upvotes: 1

Related Questions