Reputation: 2089
I have a pandas
data frame as follows:
date | Item | count
------------------------------------
2016-12-06 10:45:08 | Item1 | 60
2016-12-06 10:45:08 | Item2 | 145
2016-12-06 09:45:00 | Item1 | 60
2016-12-06 09:44:54 | Item3 | 600
2016-12-06 09:44:48 | Item4 | 15
2016-12-06 11:45:08 | Item1 | 60
2016-12-06 10:45:08 | Item2 | 14
2016-11-06 09:45:00 | Item1 | 62
2016-11-06 09:44:54 | Item3 | 6
2016-11-06 09:44:48 | Item4 | 15
I am trying to groupby the Items by let's say hour of the day (or later just day) to know the following statistics: list of items sold per day, such as:
2016-12-06
, from 09:00:00
to 10:00:00
, Item1 , Item3 and Item4 were sold; and so on.2016-12-06
, Item1, Item2, Item3, Item4 (unique items) were sold.While I am far away from fetching these statistics, I am stuck with grouping by time.
Initially, the print df.dtypes
showed
date object
Item object
count int64
dtype: object
So, I used the following line of code to convert the date column to a pandas date object.
df['date'] = pd.to_datetime(df['date'])
and now, print df.dtypes
yields:
date datetime64[ns]
Item object
count int64
dtype: object
However, when I try to groupby the date
column using TimeGrouper
by executing the following lines of code
from pandas.tseries.resample import TimeGrouper
print df.groupby([df['date'],pd.TimeGrouper(freq='Min')])
I get the following TypeError
. As per the suggestions given here or here, converting using pd.to_datetime should have resolved this issue.
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
I have no clue how to solve this issue to proceed to the statistics I am looking for. Any tips on resolving this error and using TimeGrouper to seek the statistics preferably in a dictionary format (or anything that makes more sense) would be much appreciated.
Upvotes: 5
Views: 871
Reputation: 863246
You can use groupby
by numpy array
- datetimes with removed minutes
and seconds
:
print (df['date'].values.astype('<M8[h]'))
['2016-12-06T10' '2016-12-06T10' '2016-12-06T09' '2016-12-06T09'
'2016-12-06T09' '2016-12-06T11' '2016-12-06T10' '2016-11-06T09'
'2016-11-06T09' '2016-11-06T09']
print (df.groupby(df['date'].values.astype('<M8[h]')).Item.unique())
2016-11-06 09:00:00 [Item1, Item3, Item4]
2016-12-06 09:00:00 [Item1, Item3, Item4]
2016-12-06 10:00:00 [Item1, Item2]
2016-12-06 11:00:00 [Item1]
Name: Item, dtype: object
print (df.groupby(df['date'].values.astype('<M8[h]')).Item
.apply(lambda x: x.unique().tolist()).to_dict())
{Timestamp('2016-11-06 09:00:00'): ['Item1', 'Item3', 'Item4'],
Timestamp('2016-12-06 09:00:00'): ['Item1', 'Item3', 'Item4'],
Timestamp('2016-12-06 10:00:00'): ['Item1', 'Item2'],
Timestamp('2016-12-06 11:00:00'): ['Item1']}
print (df.groupby(df['date'].values.astype('<M8[D]')).Item
.apply(lambda x: x.unique().tolist()).to_dict())
{Timestamp('2016-11-06 00:00:00'): ['Item1', 'Item3', 'Item4'],
Timestamp('2016-12-06 00:00:00'): ['Item1', 'Item2', 'Item3', 'Item4']}
Thank you Jeff for suggestion use round
:
print (df.groupby(df['date'].dt.round('h')).Item
.apply(lambda x: x.unique().tolist()).to_dict())
{Timestamp('2016-11-06 10:00:00'): ['Item1', 'Item3', 'Item4'],
Timestamp('2016-12-06 12:00:00'): ['Item1'],
Timestamp('2016-12-06 10:00:00'): ['Item1', 'Item3', 'Item4'],
Timestamp('2016-12-06 11:00:00'): ['Item1', 'Item2']}
print (df.groupby(df['date'].dt.round('d')).Item
.apply(lambda x: x.unique().tolist()).to_dict())
{Timestamp('2016-11-06 00:00:00'): ['Item1', 'Item3', 'Item4'],
Timestamp('2016-12-06 00:00:00'): ['Item1', 'Item2', 'Item3', 'Item4']}
Upvotes: 3
Reputation: 294488
sold = df.set_index('date').Item.resample('H').agg({'Sold': 'unique'})
sold[sold.Sold.str.len() > 0]
Sold
date
2016-11-06 09:00:00 [Item4, Item3, Item1]
2016-12-06 09:00:00 [Item4, Item3, Item1]
2016-12-06 10:00:00 [Item1, Item2]
2016-12-06 11:00:00 [Item1]
Upvotes: 3