user2949037
user2949037

Reputation: 13

Selecting values from pandas data frame using multiple conditions

I have the following dataframe in Pandas. Score and Date_of_interest columns are to be calculated. Below it is already filled out to make the explanation of the problem easy.

First let's assume that Score and Date_of_interest columns are filled with NaN's only. Below are the steps to fill the values in them.

a) We are trying to get one date of interest, based on the criteria described below for one PC_id eg. PC_id 200 has 1998-04-10 02:25:00 and so on.

b) To solve this problem we take the PC_id column and check each row to find the change in Item_id, each has a score of 1. For the same Item_id like in 1st row and second row, has 1 and 1 so the value starts with 1 but does not change in second row.

c) While moving and calculating the score for the second row it also checks the Datetime difference, if the previous one is more than 24 hours old, it is dropped and score is reset to 1 and cursor moves to third row.

d) When the Score reaches 2, we have reached the qualifying score as in row no 5(index 4) and we copy the corresponding Datetime in Date_of_interest column.

e) We start the new cycle for new PC_id as in row six.

   Datetime        Item_id     PC_id       Value     Score    Date_of_interest

0   1998-04-8 01:00:00   1      200          35         1       NaN
1   1998-04-8 02:00:00   1      200          92         1       NaN
2   1998-04-10 02:00:00  2      200          35         1       NaN
3   1998-04-10 02:15:00  2      200          92         1       NaN
4   1998-04-10 02:25:00  3      200          92         2     1998-04-10 02:25:00

5   1998-04-10 03:00:00  1      201          93         1       NaN
6   1998-04-12 03:30:00  3      201          94         1       NaN
7   1998-04-12 04:00:00  4      201          95         2       NaN
8   1998-04-12 04:00:00  4      201          26         2     1998-04-12 04:00:00
9   1998-04-12 04:30:00  2      201          98         3       NaN

10  1998-04-12 04:50:00  1      202         100         1       NaN
11  1998-04-15 05:00:00  4      202         100         1       NaN
12  1998-04-15 05:15:00  3      202         100         2   1998-04-15 05:15:00
13  1998-04-15 05:30:00  2      202         100         3       NaN
14  1998-04-15 06:00:00  3      202         100         NaN     NaN
15  1998-04-15 06:00:00  3      202         222         NaN     NaN

Final table should be as follows:

    PC_id      Date_of_interest  

0   200       1998-04-10 02:25:00
1   201       1998-04-12 04:00:00
2   202       1998-04-15 05:15:00

Thanks for helping.

Update : Code I am working on currently:

df_merged_unique = df_merged['PC_id'].unique()
score = 0

for i, row in df_merged.iterrows():
    for elem in df_merged_unique:
        first_date = row['Datetime']
        first_item = 0
        if row['PC_id'] == elem:
            if row['Score'] < 2:
                if row['Item_id'] != first_item:
                    if row['Datetime']-first_date <= pd.datetime.timedelta(days=1):
                        score += 1
                        row['Score'] = score
                        first_date = row['Datetime']
                    else:
                        pass
                else:
                    pass
            else:
                row['Date_of_interest'] = row['Datetime']
                break
        else:
            pass

Upvotes: 1

Views: 184

Answers (1)

Ilja Everil&#228;
Ilja Everil&#228;

Reputation: 52929

Usually having to resort to iterative/imperative methods is a sign of trouble when working with pandas. Given the dataframe

In [111]: df2
Out[111]: 
              Datetime  Item_id  PC_id  Value
0  1998-04-08 01:00:00        1    200     35
1  1998-04-08 02:00:00        1    200     92
2  1998-04-10 02:00:00        2    200     35
3  1998-04-10 02:15:00        2    200     92
4  1998-04-10 02:25:00        3    200     92
5  1998-04-10 03:00:00        1    201     93
6  1998-04-12 03:30:00        3    201     94
7  1998-04-12 04:00:00        4    201     95
8  1998-04-12 04:00:00        4    201     26
9  1998-04-12 04:30:00        2    201     98
10 1998-04-12 04:50:00        1    202    100
11 1998-04-15 05:00:00        4    202    100
12 1998-04-15 05:15:00        3    202    100
13 1998-04-15 05:30:00        2    202    100
14 1998-04-15 06:00:00        3    202    100
15 1998-04-15 06:00:00        3    202    222

you could first group by PC_id

In [112]: the_group = df2.groupby('PC_id')

and then apply the search using diff() to get the rows where Item_id and Datetime change appropriately

In [357]: (the_group['Item_id'].diff() != 0) & \
     ...: (the_group['Datetime'].diff() <= timedelta(days=1))
Out[357]: 
0     False
1     False
2     False
3     False
4      True
5     False
6     False
7      True
8     False
9      True
10    False
11    False
12     True
13     True
14     True
15    False
16    False
dtype: bool

and then just take the first date (first match) in each group, if any

In [341]: df2[(the_group['Item_id'].diff() != 0) &
     ...:     (the_group['Datetime'].diff() <= timedelta(days=1))]\
     ...: .groupby('PC_id').first()['Datetime'].reset_index()
Out[341]: 
   PC_id            Datetime
0    200 1998-04-10 02:25:00
1    201 1998-04-12 04:00:00
2    202 1998-04-15 05:15:00

Upvotes: 1

Related Questions