Reputation: 402
I work in a kiosk company, we're looking to find if a UI update has made any difference. Each machine has had an update at different dates/times. I have built a dictionary of machine_ids and the timestamps at which the new UI was installed. I then want to use this to filter results, so only return rows where the machine_id is in the dictionary and deposit dates are greater than the corresponding date in the dict
uidict= {
14.0: Timestamp('2018-10-12 17:48:57'),
16.0: Timestamp('2018-10-12 13:38:00'),
19.0: Timestamp('2018-10-17 20:17:33'),
20.0: Timestamp('2018-10-15 12:15:34'),
27.0: Timestamp('2018-09-26 11:50:01'),
29.0: Timestamp('2018-10-03 13:38:17'),
31.0: Timestamp('2018-10-17 10:06:23'),
33.0: Timestamp('2018-09-21 15:17:14'),
34.0: Timestamp('2018-10-17 11:42:21'),
42.0: Timestamp('2018-10-16 12:36:32'),
45.0: Timestamp('2018-09-23 13:23:37'),
53.0: Timestamp('2018-09-27 12:18:39'),
60.0: Timestamp('2018-10-15 15:27:46'),
62.0: Timestamp('2018-08-30 17:26:27'),
63.0: Timestamp('2018-09-25 17:44:04'),
64.0: Timestamp('2018-09-23 14:19:57'),
65.0: Timestamp('2018-08-31 19:07:47'),
66.0: Timestamp('2018-09-08 14:12:20'),
67.0: Timestamp('2018-09-11 08:18:31'),
69.0: Timestamp('2018-09-20 17:12:37'),
70.0: Timestamp('2018-09-24 12:56:45'),
71.0: Timestamp('2018-08-27 09:37:17'),
72.0: Timestamp('2018-09-05 19:07:34'),
73.0: Timestamp('2018-09-10 14:42:52'),
74.0: Timestamp('2018-09-25 16:36:05'),
75.0: Timestamp('2018-08-27 10:09:02'),
76.0: Timestamp('2018-09-13 07:20:40'),
77.0: Timestamp('2018-09-02 14:10:22'),
78.0: Timestamp('2018-09-26 15:06:51'),
79.0: Timestamp('2018-08-31 15:52:49'),
81.0: Timestamp('2018-10-05 10:05:11')}
I tried this filtering to make it work:
df[(df.machine_id.isin(uidict.keys()))&(df.deposited_at>uidict[df.machine_id])]
But this returns
TypeError: 'Series' objects are mutable, thus they cannot be hashed
So I thought I'd forget the dictionary and just use the groupby Series I had produced but..
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-90-10d8db20a295> in <module>()
----> 1 df[(df.machine_name.isin(newuidict.index))&(df.deposited_at>newuidict[df.machine_name])]
~/anaconda3/lib/python3.6/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
816 if not self._indexed_same(other):
817 msg = 'Can only compare identically-labeled Series objects'
--> 818 raise ValueError(msg)
819 return self._constructor(na_op(self.values, other.values),
820 index=self.index, name=name)
ValueError: Can only compare identically-labeled Series objects
Running this with a function and apply takes ages and I'll have to run this code fairly often, is there any way of getting this kind of filtering to work?
Small example of data:
machine_id deposited_at
12 2018-10-04 14:49:38
56 2018-09-20 14:41:59
24 2018-08-25 14:50:07
56 2018-08-04 15:33:09
12 2018-08-01 18:18:44
24 2018-09-24 12:34:35
35 2018-10-01 17:09:38
21 2018-09-27 11:32:02
21 2018-09-27 11:33:55
23 2018-08-30 10:03:01
Upvotes: 0
Views: 422
Reputation: 61
df['flag'] = df.apply(lambda x: True if x['deposited_at'].iloc[0] >= uidict(x['machine_id'].iloc[0]) else False)
df[df['flag'] == True]
Upvotes: 1
Reputation: 181
[Answer requires Python 3 and Pandas]
If it isn't too much trouble to alter your uidict, you could rather turn that into a dataframe and use a join. I'll illustrate the process below:
First, recreating your uidict:
import pandas as pd
from pandas import Timestamp
uidict= {
14.0: Timestamp('2018-10-12 17:48:57'),
16.0: Timestamp('2018-10-12 13:38:00'),
19.0: Timestamp('2018-10-17 20:17:33'),
20.0: Timestamp('2018-10-15 12:15:34'),
27.0: Timestamp('2018-09-26 11:50:01'),
29.0: Timestamp('2018-10-03 13:38:17'),
31.0: Timestamp('2018-10-17 10:06:23'),
33.0: Timestamp('2018-09-21 15:17:14'),
34.0: Timestamp('2018-10-17 11:42:21'),
42.0: Timestamp('2018-10-16 12:36:32'),
45.0: Timestamp('2018-09-23 13:23:37'),
53.0: Timestamp('2018-09-27 12:18:39'),
60.0: Timestamp('2018-10-15 15:27:46'),
62.0: Timestamp('2018-08-30 17:26:27'),
63.0: Timestamp('2018-09-25 17:44:04'),
64.0: Timestamp('2018-09-23 14:19:57'),
65.0: Timestamp('2018-08-31 19:07:47'),
66.0: Timestamp('2018-09-08 14:12:20'),
67.0: Timestamp('2018-09-11 08:18:31'),
69.0: Timestamp('2018-09-20 17:12:37'),
70.0: Timestamp('2018-09-24 12:56:45'),
71.0: Timestamp('2018-08-27 09:37:17'),
72.0: Timestamp('2018-09-05 19:07:34'),
73.0: Timestamp('2018-09-10 14:42:52'),
74.0: Timestamp('2018-09-25 16:36:05'),
75.0: Timestamp('2018-08-27 10:09:02'),
76.0: Timestamp('2018-09-13 07:20:40'),
77.0: Timestamp('2018-09-02 14:10:22'),
78.0: Timestamp('2018-09-26 15:06:51'),
79.0: Timestamp('2018-08-31 15:52:49'),
81.0: Timestamp('2018-10-05 10:05:11')
}
We then can use this line to create a pandas dataframe, I've named the key of your dictionary as "machine_id" for convenience later.
uidf = pd.DataFrame(list(uidict.items()),columns=['machine_id','ui_date'])
Which results in:
machine_id ui_date
0 64.0 2018-09-23 14:19:57
1 65.0 2018-08-31 19:07:47
2 66.0 2018-09-08 14:12:20
3 67.0 2018-09-11 08:18:31
4 69.0 2018-09-20 17:12:37
5 70.0 2018-09-24 12:56:45
6 71.0 2018-08-27 09:37:17
7 72.0 2018-09-05 19:07:34
8 73.0 2018-09-10 14:42:52
9 74.0 2018-09-25 16:36:05
10 75.0 2018-08-27 10:09:02
11 76.0 2018-09-13 07:20:40
12 77.0 2018-09-02 14:10:22
13 14.0 2018-10-12 17:48:57
14 79.0 2018-08-31 15:52:49
15 16.0 2018-10-12 13:38:00
16 81.0 2018-10-05 10:05:11
17 19.0 2018-10-17 20:17:33
18 20.0 2018-10-15 12:15:34
19 78.0 2018-09-26 15:06:51
20 27.0 2018-09-26 11:50:01
21 29.0 2018-10-03 13:38:17
22 31.0 2018-10-17 10:06:23
23 33.0 2018-09-21 15:17:14
24 34.0 2018-10-17 11:42:21
25 42.0 2018-10-16 12:36:32
26 45.0 2018-09-23 13:23:37
27 53.0 2018-09-27 12:18:39
28 60.0 2018-10-15 15:27:46
29 62.0 2018-08-30 17:26:27
30 63.0 2018-09-25 17:44:04
Then recreating your sample data, but I have added two test case rows at the bottom, as your provided sample doesn't appear to have any matches on uidict. Specifically, one row with machine_id = 81, but the date is earlier than that in uidict, and one where the date falls after.
data_sample = pd.DataFrame(
[
{'machine_id': 12, 'deposited_at' : Timestamp('2018-10-04 14:49:38')},
{'machine_id': 56, 'deposited_at' : Timestamp('2018-09-20 14:41:59')},
{'machine_id': 24, 'deposited_at' : Timestamp('2018-08-25 14:50:07')},
{'machine_id': 56, 'deposited_at' : Timestamp('2018-08-04 15:33:09')},
{'machine_id': 12, 'deposited_at' : Timestamp('2018-08-01 18:18:44')},
{'machine_id': 24, 'deposited_at' : Timestamp('2018-09-24 12:34:35')},
{'machine_id': 35, 'deposited_at' : Timestamp('2018-10-01 17:09:38')},
{'machine_id': 21, 'deposited_at' : Timestamp('2018-09-27 11:32:02')},
{'machine_id': 21, 'deposited_at' : Timestamp('2018-09-27 11:33:55')},
{'machine_id': 23, 'deposited_at' : Timestamp('2018-08-30 10:03:01')},
{'machine_id': 81, 'deposited_at' : Timestamp('2018-09-01 10:03:01')},
{'machine_id': 81, 'deposited_at' : Timestamp('2018-10-06 10:03:01')}
]
)
deposited_at machine_id
0 2018-10-04 14:49:38 12
1 2018-09-20 14:41:59 56
2 2018-08-25 14:50:07 24
3 2018-08-04 15:33:09 56
4 2018-08-01 18:18:44 12
5 2018-09-24 12:34:35 24
6 2018-10-01 17:09:38 35
7 2018-09-27 11:32:02 21
8 2018-09-27 11:33:55 21
9 2018-08-30 10:03:01 23
10 2018-09-01 10:03:01 81
11 2018-10-06 10:03:01 81
We then do an inner join on these two DataFrames using the "machine_id" as a key, and then follow up with a simple filtering condition on the date. The last line here is to simply clean the columns to resemble your original input.
filtered_dataframe = data_sample.merge(uidf, on=['machine_id'], how='inner')
filtered_dataframe = filtered_dataframe[
filtered_dataframe['deposited_at'] > filtered_dataframe['ui_date']
]
filtered_dataframe = filtered_dataframe[['machine_id', 'deposited_at']]
Which effectively ensures that 1) The machine id in the data sample is within your UI table and 2) That the deposited date is greater than the date in the UI table:
machine_id deposited_at
1 81 2018-10-06 10:03:01
Hope this is what you are looking for!
Upvotes: 1