Leon Kyriacou
Leon Kyriacou

Reputation: 402

Pandas Filtering Using Dictionary

I work in a kiosk company, we're looking to find if a UI update has made any difference. Each machine has had an update at different dates/times. I have built a dictionary of machine_ids and the timestamps at which the new UI was installed. I then want to use this to filter results, so only return rows where the machine_id is in the dictionary and deposit dates are greater than the corresponding date in the dict

uidict= {
 14.0: Timestamp('2018-10-12 17:48:57'),
 16.0: Timestamp('2018-10-12 13:38:00'),
 19.0: Timestamp('2018-10-17 20:17:33'),
 20.0: Timestamp('2018-10-15 12:15:34'),
 27.0: Timestamp('2018-09-26 11:50:01'),
 29.0: Timestamp('2018-10-03 13:38:17'),
 31.0: Timestamp('2018-10-17 10:06:23'),
 33.0: Timestamp('2018-09-21 15:17:14'),
 34.0: Timestamp('2018-10-17 11:42:21'),
 42.0: Timestamp('2018-10-16 12:36:32'),
 45.0: Timestamp('2018-09-23 13:23:37'),
 53.0: Timestamp('2018-09-27 12:18:39'),
 60.0: Timestamp('2018-10-15 15:27:46'),
 62.0: Timestamp('2018-08-30 17:26:27'),
 63.0: Timestamp('2018-09-25 17:44:04'),
 64.0: Timestamp('2018-09-23 14:19:57'),
 65.0: Timestamp('2018-08-31 19:07:47'),
 66.0: Timestamp('2018-09-08 14:12:20'),
 67.0: Timestamp('2018-09-11 08:18:31'),
 69.0: Timestamp('2018-09-20 17:12:37'),
 70.0: Timestamp('2018-09-24 12:56:45'),
 71.0: Timestamp('2018-08-27 09:37:17'),
 72.0: Timestamp('2018-09-05 19:07:34'),
 73.0: Timestamp('2018-09-10 14:42:52'),
 74.0: Timestamp('2018-09-25 16:36:05'),
 75.0: Timestamp('2018-08-27 10:09:02'),
 76.0: Timestamp('2018-09-13 07:20:40'),
 77.0: Timestamp('2018-09-02 14:10:22'),
 78.0: Timestamp('2018-09-26 15:06:51'),
 79.0: Timestamp('2018-08-31 15:52:49'),
 81.0: Timestamp('2018-10-05 10:05:11')}

I tried this filtering to make it work:

df[(df.machine_id.isin(uidict.keys()))&(df.deposited_at>uidict[df.machine_id])]

But this returns

TypeError: 'Series' objects are mutable, thus they cannot be hashed

So I thought I'd forget the dictionary and just use the groupby Series I had produced but..

 ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-90-10d8db20a295> in <module>()
----> 1 df[(df.machine_name.isin(newuidict.index))&(df.deposited_at>newuidict[df.machine_name])]

~/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
    816             if not self._indexed_same(other):
    817                 msg = 'Can only compare identically-labeled Series objects'
--> 818                 raise ValueError(msg)
    819             return self._constructor(na_op(self.values, other.values),
    820                                      index=self.index, name=name)

ValueError: Can only compare identically-labeled Series objects

Running this with a function and apply takes ages and I'll have to run this code fairly often, is there any way of getting this kind of filtering to work?

Small example of data:

 machine_id deposited_at
12  2018-10-04 14:49:38
56  2018-09-20 14:41:59
24  2018-08-25 14:50:07
56  2018-08-04 15:33:09
12  2018-08-01 18:18:44
24  2018-09-24 12:34:35
35  2018-10-01 17:09:38
21  2018-09-27 11:32:02
21  2018-09-27 11:33:55
23  2018-08-30 10:03:01

Upvotes: 0

Views: 422

Answers (2)

Rauf Bhat
Rauf Bhat

Reputation: 61

df['flag'] = df.apply(lambda x: True if x['deposited_at'].iloc[0] >= uidict(x['machine_id'].iloc[0]) else False)

df[df['flag'] == True]

Upvotes: 1

Wade Pimenta
Wade Pimenta

Reputation: 181

[Answer requires Python 3 and Pandas]

If it isn't too much trouble to alter your uidict, you could rather turn that into a dataframe and use a join. I'll illustrate the process below:

First, recreating your uidict:

import pandas as pd
from pandas import Timestamp

uidict= {
 14.0: Timestamp('2018-10-12 17:48:57'),
 16.0: Timestamp('2018-10-12 13:38:00'),
 19.0: Timestamp('2018-10-17 20:17:33'),
 20.0: Timestamp('2018-10-15 12:15:34'),
 27.0: Timestamp('2018-09-26 11:50:01'),
 29.0: Timestamp('2018-10-03 13:38:17'),
 31.0: Timestamp('2018-10-17 10:06:23'),
 33.0: Timestamp('2018-09-21 15:17:14'),
 34.0: Timestamp('2018-10-17 11:42:21'),
 42.0: Timestamp('2018-10-16 12:36:32'),
 45.0: Timestamp('2018-09-23 13:23:37'),
 53.0: Timestamp('2018-09-27 12:18:39'),
 60.0: Timestamp('2018-10-15 15:27:46'),
 62.0: Timestamp('2018-08-30 17:26:27'),
 63.0: Timestamp('2018-09-25 17:44:04'),
 64.0: Timestamp('2018-09-23 14:19:57'),
 65.0: Timestamp('2018-08-31 19:07:47'),
 66.0: Timestamp('2018-09-08 14:12:20'),
 67.0: Timestamp('2018-09-11 08:18:31'),
 69.0: Timestamp('2018-09-20 17:12:37'),
 70.0: Timestamp('2018-09-24 12:56:45'),
 71.0: Timestamp('2018-08-27 09:37:17'),
 72.0: Timestamp('2018-09-05 19:07:34'),
 73.0: Timestamp('2018-09-10 14:42:52'),
 74.0: Timestamp('2018-09-25 16:36:05'),
 75.0: Timestamp('2018-08-27 10:09:02'),
 76.0: Timestamp('2018-09-13 07:20:40'),
 77.0: Timestamp('2018-09-02 14:10:22'),
 78.0: Timestamp('2018-09-26 15:06:51'),
 79.0: Timestamp('2018-08-31 15:52:49'),
 81.0: Timestamp('2018-10-05 10:05:11')
}

We then can use this line to create a pandas dataframe, I've named the key of your dictionary as "machine_id" for convenience later.

uidf = pd.DataFrame(list(uidict.items()),columns=['machine_id','ui_date'])

Which results in:

    machine_id  ui_date 
0   64.0    2018-09-23 14:19:57
1   65.0    2018-08-31 19:07:47
2   66.0    2018-09-08 14:12:20
3   67.0    2018-09-11 08:18:31
4   69.0    2018-09-20 17:12:37
5   70.0    2018-09-24 12:56:45
6   71.0    2018-08-27 09:37:17
7   72.0    2018-09-05 19:07:34
8   73.0    2018-09-10 14:42:52
9   74.0    2018-09-25 16:36:05
10  75.0    2018-08-27 10:09:02
11  76.0    2018-09-13 07:20:40
12  77.0    2018-09-02 14:10:22
13  14.0    2018-10-12 17:48:57
14  79.0    2018-08-31 15:52:49
15  16.0    2018-10-12 13:38:00
16  81.0    2018-10-05 10:05:11
17  19.0    2018-10-17 20:17:33
18  20.0    2018-10-15 12:15:34
19  78.0    2018-09-26 15:06:51
20  27.0    2018-09-26 11:50:01
21  29.0    2018-10-03 13:38:17
22  31.0    2018-10-17 10:06:23
23  33.0    2018-09-21 15:17:14
24  34.0    2018-10-17 11:42:21
25  42.0    2018-10-16 12:36:32
26  45.0    2018-09-23 13:23:37
27  53.0    2018-09-27 12:18:39
28  60.0    2018-10-15 15:27:46
29  62.0    2018-08-30 17:26:27
30  63.0    2018-09-25 17:44:04

Then recreating your sample data, but I have added two test case rows at the bottom, as your provided sample doesn't appear to have any matches on uidict. Specifically, one row with machine_id = 81, but the date is earlier than that in uidict, and one where the date falls after.

data_sample = pd.DataFrame(
    [
        {'machine_id': 12, 'deposited_at' : Timestamp('2018-10-04 14:49:38')},
        {'machine_id': 56, 'deposited_at' : Timestamp('2018-09-20 14:41:59')},
        {'machine_id': 24, 'deposited_at' : Timestamp('2018-08-25 14:50:07')},
        {'machine_id': 56, 'deposited_at' : Timestamp('2018-08-04 15:33:09')},
        {'machine_id': 12, 'deposited_at' : Timestamp('2018-08-01 18:18:44')},
        {'machine_id': 24, 'deposited_at' : Timestamp('2018-09-24 12:34:35')},
        {'machine_id': 35, 'deposited_at' : Timestamp('2018-10-01 17:09:38')},
        {'machine_id': 21, 'deposited_at' : Timestamp('2018-09-27 11:32:02')},
        {'machine_id': 21, 'deposited_at' : Timestamp('2018-09-27 11:33:55')},
        {'machine_id': 23, 'deposited_at' : Timestamp('2018-08-30 10:03:01')},
        {'machine_id': 81, 'deposited_at' : Timestamp('2018-09-01 10:03:01')},
        {'machine_id': 81, 'deposited_at' : Timestamp('2018-10-06 10:03:01')}
    ]
)

    deposited_at    machine_id
0   2018-10-04 14:49:38 12
1   2018-09-20 14:41:59 56
2   2018-08-25 14:50:07 24
3   2018-08-04 15:33:09 56
4   2018-08-01 18:18:44 12
5   2018-09-24 12:34:35 24
6   2018-10-01 17:09:38 35
7   2018-09-27 11:32:02 21
8   2018-09-27 11:33:55 21
9   2018-08-30 10:03:01 23
10  2018-09-01 10:03:01 81
11  2018-10-06 10:03:01 81

We then do an inner join on these two DataFrames using the "machine_id" as a key, and then follow up with a simple filtering condition on the date. The last line here is to simply clean the columns to resemble your original input.

filtered_dataframe = data_sample.merge(uidf, on=['machine_id'], how='inner')

filtered_dataframe = filtered_dataframe[
    filtered_dataframe['deposited_at'] > filtered_dataframe['ui_date']
]

filtered_dataframe = filtered_dataframe[['machine_id', 'deposited_at']]

Which effectively ensures that 1) The machine id in the data sample is within your UI table and 2) That the deposited date is greater than the date in the UI table:

    machine_id  deposited_at
1   81  2018-10-06 10:03:01

Hope this is what you are looking for!

Upvotes: 1

Related Questions