Reputation: 13
I have the following dataframe in Pandas. Score and Date_of_interest columns are to be calculated. Below it is already filled out to make the explanation of the problem easy.
First let's assume that Score and Date_of_interest columns are filled with NaN's only. Below are the steps to fill the values in them.
a) We are trying to get one date of interest, based on the criteria described below for one PC_id eg. PC_id 200 has 1998-04-10 02:25:00 and so on.
b) To solve this problem we take the PC_id column and check each row to find the change in Item_id, each has a score of 1. For the same Item_id like in 1st row and second row, has 1 and 1 so the value starts with 1 but does not change in second row.
c) While moving and calculating the score for the second row it also checks the Datetime difference, if the previous one is more than 24 hours old, it is dropped and score is reset to 1 and cursor moves to third row.
d) When the Score reaches 2, we have reached the qualifying score as in row no 5(index 4) and we copy the corresponding Datetime in Date_of_interest column.
e) We start the new cycle for new PC_id as in row six.
Datetime Item_id PC_id Value Score Date_of_interest
0 1998-04-8 01:00:00 1 200 35 1 NaN
1 1998-04-8 02:00:00 1 200 92 1 NaN
2 1998-04-10 02:00:00 2 200 35 1 NaN
3 1998-04-10 02:15:00 2 200 92 1 NaN
4 1998-04-10 02:25:00 3 200 92 2 1998-04-10 02:25:00
5 1998-04-10 03:00:00 1 201 93 1 NaN
6 1998-04-12 03:30:00 3 201 94 1 NaN
7 1998-04-12 04:00:00 4 201 95 2 NaN
8 1998-04-12 04:00:00 4 201 26 2 1998-04-12 04:00:00
9 1998-04-12 04:30:00 2 201 98 3 NaN
10 1998-04-12 04:50:00 1 202 100 1 NaN
11 1998-04-15 05:00:00 4 202 100 1 NaN
12 1998-04-15 05:15:00 3 202 100 2 1998-04-15 05:15:00
13 1998-04-15 05:30:00 2 202 100 3 NaN
14 1998-04-15 06:00:00 3 202 100 NaN NaN
15 1998-04-15 06:00:00 3 202 222 NaN NaN
Final table should be as follows:
PC_id Date_of_interest
0 200 1998-04-10 02:25:00
1 201 1998-04-12 04:00:00
2 202 1998-04-15 05:15:00
Thanks for helping.
Update : Code I am working on currently:
df_merged_unique = df_merged['PC_id'].unique()
score = 0
for i, row in df_merged.iterrows():
for elem in df_merged_unique:
first_date = row['Datetime']
first_item = 0
if row['PC_id'] == elem:
if row['Score'] < 2:
if row['Item_id'] != first_item:
if row['Datetime']-first_date <= pd.datetime.timedelta(days=1):
score += 1
row['Score'] = score
first_date = row['Datetime']
else:
pass
else:
pass
else:
row['Date_of_interest'] = row['Datetime']
break
else:
pass
Upvotes: 1
Views: 184
Reputation: 52929
Usually having to resort to iterative/imperative methods is a sign of trouble when working with pandas
. Given the dataframe
In [111]: df2
Out[111]:
Datetime Item_id PC_id Value
0 1998-04-08 01:00:00 1 200 35
1 1998-04-08 02:00:00 1 200 92
2 1998-04-10 02:00:00 2 200 35
3 1998-04-10 02:15:00 2 200 92
4 1998-04-10 02:25:00 3 200 92
5 1998-04-10 03:00:00 1 201 93
6 1998-04-12 03:30:00 3 201 94
7 1998-04-12 04:00:00 4 201 95
8 1998-04-12 04:00:00 4 201 26
9 1998-04-12 04:30:00 2 201 98
10 1998-04-12 04:50:00 1 202 100
11 1998-04-15 05:00:00 4 202 100
12 1998-04-15 05:15:00 3 202 100
13 1998-04-15 05:30:00 2 202 100
14 1998-04-15 06:00:00 3 202 100
15 1998-04-15 06:00:00 3 202 222
you could first group by PC_id
In [112]: the_group = df2.groupby('PC_id')
and then apply the search using diff()
to get the rows where Item_id and Datetime change appropriately
In [357]: (the_group['Item_id'].diff() != 0) & \
...: (the_group['Datetime'].diff() <= timedelta(days=1))
Out[357]:
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 True
8 False
9 True
10 False
11 False
12 True
13 True
14 True
15 False
16 False
dtype: bool
and then just take the first date (first match) in each group, if any
In [341]: df2[(the_group['Item_id'].diff() != 0) &
...: (the_group['Datetime'].diff() <= timedelta(days=1))]\
...: .groupby('PC_id').first()['Datetime'].reset_index()
Out[341]:
PC_id Datetime
0 200 1998-04-10 02:25:00
1 201 1998-04-12 04:00:00
2 202 1998-04-15 05:15:00
Upvotes: 1