Reputation: 347
I am writing a process which takes a semi-large file as input (~4 million rows, 5 columns)
and performs a few operations on it.
Columns:
- CARD_NO
- ID
- CREATED_DATE
- STATUS
- FLAG2
I need to create a file which contains 1 copy of each CARD_NO where STATUS = '1' and CREATED_DATE is the maximum of all CREATED_DATEs for that CARD_NO.
I succeeded but my solution is very slow (3h and counting as of right now.)
Here is my code:
file = 'input.csv'
input = pd.read_csv(file)
input = input.drop_duplicates()
card_groups = input.groupby('CARD_NO', as_index=False, sort=False).filter(lambda x: x['STATUS'] == 1)
def important(x):
latest_date = x['CREATED_DATE'].values[x['CREATED_DATE'].values.argmax()]
return x[x.CREATED_DATE == latest_date]
#where the major slowdown occurs
group_2 = card_groups.groupby('CARD_NO', as_index=False, sort=False).apply(important)
path = 'result.csv'
group_2.to_csv(path, sep=',', index=False)
# ~4 minutes for the 154k rows file
# 3+ hours for ~4m rows
I was wondering if you had any advice on how to improve the running time of this little process.
Thank you and have a good day.
Upvotes: 0
Views: 173
Reputation: 128948
Setup (FYI make sure that your use parse_dates=True
when reading your csv)
In [6]: n_groups = 10000
In [7]: N = 4000000
In [8]: dates = date_range('20130101',periods=100)
In [9]: df = DataFrame(dict(id = np.random.randint(0,n_groups,size=N), status = np.random.randint(0,10,size=N), date=np.random.choice(dates,size=N,replace=True)))
In [10]: pd.set_option('max_rows',10)
In [13]: df = DataFrame(dict(card_no = np.random.randint(0,n_groups,size=N), status = np.random.randint(0,10,size=N), date=np.random.choice(dates,size=N,replace=True)))
In [14]: df
Out[14]:
card_no date status
0 5790 2013-02-11 6
1 6572 2013-03-17 6
2 7764 2013-02-06 3
3 4905 2013-04-01 3
4 3871 2013-04-08 1
... ... ... ...
3999995 1891 2013-02-16 5
3999996 9048 2013-01-11 9
3999997 1443 2013-02-23 1
3999998 2845 2013-01-28 0
3999999 5645 2013-02-05 8
[4000000 rows x 3 columns]
In [15]: df.dtypes
Out[15]:
card_no int64
date datetime64[ns]
status int64
dtype: object
Only status == 1, groupby card_no, then return the max date for that group
In [18]: df[df.status==1].groupby('card_no')['date'].max()
Out[18]:
card_no
0 2013-04-06
1 2013-03-30
2 2013-04-09
...
9997 2013-04-07
9998 2013-04-07
9999 2013-04-09
Name: date, Length: 10000, dtype: datetime64[ns]
In [19]: %timeit df[df.status==1].groupby('card_no')['date'].max()
1 loops, best of 3: 934 ms per loop
If you need a transform of this (e.g. the same values for each group. Note that with < 0.14.1 (releasing this week) you will need to use this soln here, otherwise this will be pretty slow)
In [20]: df[df.status==1].groupby('card_no')['date'].transform('max')
Out[20]:
4 2013-04-10
13 2013-04-10
25 2013-04-10
...
3999973 2013-04-10
3999979 2013-04-10
3999997 2013-04-09
Name: date, Length: 399724, dtype: datetime64[ns]
In [21]: %timeit df[df.status==1].groupby('card_no')['date'].transform('max')
1 loops, best of 3: 1.8 s per loop
I suspect you prob want to merge the final transform back into the original frame
In [24]: df.join(res.to_frame('max_date'))
Out[24]:
card_no date status max_date
0 5790 2013-02-11 6 NaT
1 6572 2013-03-17 6 NaT
2 7764 2013-02-06 3 NaT
3 4905 2013-04-01 3 NaT
4 3871 2013-04-08 1 2013-04-10
... ... ... ... ...
3999995 1891 2013-02-16 5 NaT
3999996 9048 2013-01-11 9 NaT
3999997 1443 2013-02-23 1 2013-04-09
3999998 2845 2013-01-28 0 NaT
3999999 5645 2013-02-05 8 NaT
[4000000 rows x 4 columns]
In [25]: %timeit df.join(res.to_frame('max_date'))
10 loops, best of 3: 58.8 ms per loop
The csv writing will actually take a fair amount of time relative to this. I used HDF5 for things like this, MUCH faster.
Upvotes: 3