Reputation: 8557
I have a dataset taken from the Windows Eventlog. The TimeGenerated
column is set as the index. I'd like to get an aggregated view showing me the number of events, by EventType
(info/warn/err) and by the index value. I could use resample()
to set the datetime resolution (day, business day, etc).
Here's my DataFrame:
log.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 80372 entries, 2015-08-31 12:15:23 to 2015-05-11 04:08:07
Data columns (total 4 columns):
EventID 80372 non-null int64
SourceName 80372 non-null object
EventType 76878 non-null object
EventCategory 80372 non-null int64
dtypes: int64(2), object(2)
memory usage: 3.1+ MB
I can of course group by the EventType, but this drops my index:
log[['EventID', 'EventType']].groupby('EventType').count('EventID')
I would have to specify my existing index it in the call to groupby()
, but how can I reference the index? Or do I have to perform a reset_index()
before the groupby()
call? Or am I simply going about this all wrong and is it painfully obvious that I'm a Pandas newbie? ;-)
Version info:
To clarify further, what I'd like to achieve is:
Note that the Timestamp is not unique (in the raw DF), as multiple events can occur simultaneously.
One way I've been able to achieve what I wanted, is by doing:
temp = log.reset_index()
temp.groupby(['TimeGenerated','EventType']).count('EventID'['EventID'].unstack().fillna(0)
In that case, my output is:
Which then allows me to resample the count further, e.g. :
temp.resample('MS', how='sum')
This works, but what I don't know if whether having to perform a reset_index()
is necessary to achieve this grouping. Could I have done it in a better (read: more efficient) way?
Upvotes: 2
Views: 7935
Reputation: 8557
What I was missing is that you can perform a groupby()
on one or more levels of your index.
test = log.set_index('EventType', append=True)
test = test.groupby(level=[0,1])['EventID'].count('EventID')
test.unstack().fillna(0)
Alternatively, the suggestion by Brian Pendleton worked as well:
pd.get_dummies(log.EventType)
The difference with this last approach is that it doesn't work as well if you need to add another level in your column axis (e.g. by Hostname). But that wasn't part of the original question of course.
Upvotes: 4