Reputation: 4651
I have the following function where df is a pandas dataframe that is 159538 rows x 3 columns:
dfs = []
for i in df['email_address']:
data = df[df['email_address'] == i]
data['difference'] = data['ts_placed'].diff().astype('timedelta64[D]')
repeat = []
for a in data['difference']:
if a > 10:
repeat.append(0)
elif a <= 10:
repeat.append(1)
else:
repeat.append(0)
data['repeat'] = repeat
dfs.append(data)
the function runs extremely slow. I would like to speed up the process by using multiprocessing. This SO question shows how to do this in R. What is the equivalent code for python?
this is a sample of the data after running:
df['difference'] = df.groupby('email_address')['ts_placed'].diff()
df
Out[6]:
email_address ts_placed difference
0 [email protected] 2015-08-06 00:00:34 NaT
1 [email protected] 2015-08-06 00:05:38 NaT
2 [email protected] 2015-08-06 00:09:20 NaT
3 [email protected] 2015-08-06 00:10:01 NaT
4 terry.wfdfdfdfdfy-holdings.co.uk 2015-08-06 00:14:00 NaT
5 [email protected] 2015-08-06 00:14:00 NaT
6 [email protected] 2015-08-06 00:14:00 NaT
7 [email protected] 2015-08-06 00:14:20 NaT
8 [email protected] 2015-08-06 00:14:43 NaT
9 [email protected] 2015-08-06 00:17:03 NaT
10 [email protected] 2015-08-06 00:17:58 NaT
...
22 [email protected] 2015-08-06 00:46:12 0 days 00:04:15
Upvotes: 1
Views: 147
Reputation: 393933
IIUC then you can do the following:
df['difference'] = df.groupby('email_address')['ts_placed'].diff()
df['repeat'] = df.groupby('email_address')['difference'].transform(lambda x: (x < 10).cumcount())
Upvotes: 1