Reputation: 35704
I've got time series data that I'd like to get a count of actions that happened per day/hour combination
I'm using Counter
object to get the counts as I process items line by line
c = Counter()
for line in file
c.update([[yyyymmdd, hh]])
# or c[yyyymmdd,hh] += 1
how can I get a pivot of yyyymmdd
as rows, hh
as columns, and count as values?
Of course I could loop through the resulting count to generate the pivot, but I'm wondering if there is a function (or python trick) that can do this in a line or two
Upvotes: 1
Views: 42
Reputation: 20563
Here is a naive and non-optimised approach for the pivot based on Counter:
In [1]: from io import StringIO # Py2 from StringIO import StringIO
...: from datetime import datetime
...: from collections import Counter
...: import random
...:
In [2]: data = [] # generate some random data
...: for i in range(30):
...: d = datetime(2016, 9, random.randrange(1, 7), random.randrange(1, 12), 0, 0)
...: data.append(d.strftime("%Y%m%d,%H"))
...: s = StringIO("\n".join(data))
...:
In [3]: data[:5]
Out[3]: ['20160902,05', '20160901,05', '20160902,06', '20160902,05', '20160905,01']
In [4]: c = Counter()
...: for line in s.readlines():
...: c[tuple(line.strip().split(","))] += 1 # use tuple as key, list is not hashable
...:
In [5]: c.most_common(5)
Out[5]:
[(('20160905', '04'), 4),
(('20160902', '01'), 2),
(('20160902', '05'), 2),
(('20160904', '05'), 2),
(('20160905', '01'), 2)]
In [6]:
In [6]: def print_pivot(c):
...: labels = list(c.keys())
...: # get unique yyyymmdd && hh values as index & columns
...: index, columns = sorted({l[0] for l in labels}, key=int), sorted({l[1] for l in labels}, key=int)
...: header = " "*8 + " | " + " | ".join(columns) + " |"
...: print(header, "\n", "-"*len(header))
...: # basically loop and get the (index, column) combination
...: # from the Collection and print out value or blank
...: for idx in index:
...: print(idx + " |", " | ".join(str(c.get((idx, col), "")).ljust(2) for col in columns), "|")
...:
In [7]: print_pivot(c)
| 01 | 02 | 03 | 04 | 05 | 06 | 09 | 10 |
----------------------------------------------------------------
20160901 | | | 1 | | 1 | | | |
20160902 | 2 | | | 1 | 2 | 1 | | 1 |
20160903 | | 1 | | 1 | | | 1 | 1 |
20160904 | 1 | | | | 2 | 1 | | 1 |
20160905 | 2 | 1 | | 4 | 1 | 1 | | |
20160906 | | 1 | | | 1 | | | 1 |
And seriously, I mean seriously -- just use pandas for the sake of simplicity and ease of use:
In [12]: import pandas as pd
# you may use pd.read_csv(..., **some options)
In [13]: df = pd.DataFrame(data=list(c.elements()), columns=["yyyymmdd", "hh"])
In [14]: df[:5]
Out[14]:
yyyymmdd hh
0 20160902 01
1 20160902 01
2 20160902 05
3 20160902 05
4 20160904 01
In [15]: df['value'] = df['hh'] # make a copy so we have columns + values
# df.groupby will group both columns and count the remaing "value"
# unstack the result will re-arrange "hh" as columns
# fillna -- fill the "na" fields as blank ""
In [16]: df.groupby(['yyyymmdd', 'hh']).count().unstack().fillna('')
Out[17]:
value
hh 01 02 03 04 05 06 09 10
yyyymmdd
20160901 1 1
20160902 2 1 2 1 1
20160903 1 1 1 1
20160904 1 2 1 1
20160905 2 1 4 1 1
20160906 1 1 1
Upvotes: 1