skanga
skanga

Reputation: 351

How to chart aggregated time series data with matplotlib

I have a csv file of errors with the timestamp of when each error occurred. Sample data looks like this

2020-01-06T02:54:01.012+0000, 500 Internal Server Error
2020-01-06T05:04:01.012+0000, 500 Internal Server Error
2020-01-06T05:44:01.012+0000, 500 Internal Server Error
2020-01-06T07:04:01.013+0000, 500 Internal Server Error
2020-01-06T08:04:01.012+0000, 500 Internal Server Error
2020-01-06T10:24:01.010+0000, 500 Internal Server Error
2020-01-06T17:48:31.192+0000, 503 Service Unavailable
2020-01-08T04:35:48.624+0000, 502 Bad Gateway
2020-01-08T16:56:04.814+0000, 503 Service Unavailable

I would like to use matplotlib to show the errors per minute (or 30 sec) with each error as a separate line on the chart. Do not know how to do the aggregation? This is what I tried. Any help?


import matplotlib.pyplot as plt
import pandas as pd

infile="example.csv"

dateparse = lambda time_str: pd.datetime.strptime(time_str, '%Y-%m-%dT%H:%M:%S.%f')
df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse, index_col="datetime", names=['datetime','error'])

df.plot()
plt.show()

Upvotes: 0

Views: 355

Answers (1)

Valdi_Bo
Valdi_Bo

Reputation: 30971

Start from defining the following fuction:

def counts(grp):
    codeList = ['500', '502', '503']
    return pd.Series([np.count_nonzero(grp.eq(code)) for code in codeList],
        index=map(lambda x: 'Err_' + x, codeList))

It will be applied soon, to each group resulting from resampling of your DataFrame.

Read your file as follows:

df = pd.read_csv('your_log.csv', names=['Date', 'Message'],
    skipinitialspace=True, parse_dates=[0])

Note skipinitialspace parameter, needed to strip initial space after each comma (after the date/time).

Then run: df.Date = df.Date.dt.tz_localize(None) to drop the timezone part (+0000) from the result.

As comparison of full strings is awkward, I decided to compare only error codes (3 initial chars of each message). To this end, let's create a new column with just error codes:

df['Code'] = df.Message.str.slice(0,3)

The next step it to generate numbers of errors in each hour, by resampling and applying the above function to each hourly group:

errCnt = df.resample('1H', on='Date').Code.apply(counts).unstack(level=1)

If you want any other time resolution, change 1H to the desired period.

And the last step is to add Total column (if you need it):

errCnt['Total'] = errCnt.sum(axis=1)

For my sample data (with some more errors, in a shorter peried), I got:

                     Err_500  Err_502  Err_503  Total
Date                                                       
2020-01-06 02:00:00        1        0        0      1
2020-01-06 03:00:00        0        0        0      0
2020-01-06 04:00:00        0        0        0      0
2020-01-06 05:00:00        1        2        2      5
2020-01-06 06:00:00        0        0        0      0
2020-01-06 07:00:00        1        0        0      1
2020-01-06 08:00:00        1        0        0      1

Upvotes: 1

Related Questions