Daniel
Daniel

Reputation: 35704

get a pivot from count using a 2d key

I've got time series data that I'd like to get a count of actions that happened per day/hour combination

I'm using Counter object to get the counts as I process items line by line

 c = Counter()
 for line in file
     c.update([[yyyymmdd, hh]])
     # or c[yyyymmdd,hh] += 1

how can I get a pivot of yyyymmdd as rows, hh as columns, and count as values?

Of course I could loop through the resulting count to generate the pivot, but I'm wondering if there is a function (or python trick) that can do this in a line or two

Upvotes: 1

Views: 42

Answers (1)

Anzel
Anzel

Reputation: 20563

Here is a naive and non-optimised approach for the pivot based on Counter:

In [1]: from io import StringIO  # Py2 from StringIO import StringIO
   ...: from datetime import datetime
   ...: from collections import Counter
   ...: import random
   ...:

In [2]: data = []  # generate some random data
   ...: for i in range(30):
   ...:     d = datetime(2016, 9, random.randrange(1, 7), random.randrange(1, 12), 0, 0)
   ...:     data.append(d.strftime("%Y%m%d,%H"))
   ...: s = StringIO("\n".join(data))
   ...:

In [3]: data[:5]
Out[3]: ['20160902,05', '20160901,05', '20160902,06', '20160902,05', '20160905,01']

In [4]: c = Counter()
   ...: for line in s.readlines():
   ...:     c[tuple(line.strip().split(","))] += 1  # use tuple as key, list is not hashable
   ...:

In [5]: c.most_common(5)
Out[5]:
[(('20160905', '04'), 4),
 (('20160902', '01'), 2),
 (('20160902', '05'), 2),
 (('20160904', '05'), 2),
 (('20160905', '01'), 2)]

In [6]:

In [6]: def print_pivot(c):
   ...:     labels = list(c.keys())
   ...:     # get unique yyyymmdd && hh values as index & columns
   ...:     index, columns = sorted({l[0] for l in labels}, key=int), sorted({l[1] for l in labels}, key=int)
   ...:     header = " "*8 + " | " + "  |  ".join(columns) + " |"
   ...:     print(header, "\n", "-"*len(header))
   ...:     # basically loop and get the (index, column) combination
   ...:     # from the Collection and print out value or blank
   ...:     for idx in index:
   ...:         print(idx + " |", "  |  ".join(str(c.get((idx, col), "")).ljust(2) for col in columns), "|")
   ...:

In [7]: print_pivot(c)
         | 01  |  02  |  03  |  04  |  05  |  06  |  09  |  10 |
 ----------------------------------------------------------------
20160901 |     |      |  1   |      |  1   |      |      |     |
20160902 | 2   |      |      |  1   |  2   |  1   |      |  1  |
20160903 |     |  1   |      |  1   |      |      |  1   |  1  |
20160904 | 1   |      |      |      |  2   |  1   |      |  1  |
20160905 | 2   |  1   |      |  4   |  1   |  1   |      |     |
20160906 |     |  1   |      |      |  1   |      |      |  1  |

And seriously, I mean seriously -- just use pandas for the sake of simplicity and ease of use:

In [12]: import pandas as pd
         # you may use pd.read_csv(..., **some options)
In [13]: df = pd.DataFrame(data=list(c.elements()), columns=["yyyymmdd", "hh"])

In [14]: df[:5]
Out[14]:
   yyyymmdd  hh
0  20160902  01
1  20160902  01
2  20160902  05
3  20160902  05
4  20160904  01

In [15]: df['value'] = df['hh']  # make a copy so we have columns + values

# df.groupby will group both columns and count the remaing "value"
# unstack the result will re-arrange "hh" as columns
# fillna -- fill the "na" fields as blank "" 
In [16]: df.groupby(['yyyymmdd', 'hh']).count().unstack().fillna('')
Out[17]:
         value
hh          01 02 03 04 05 06 09 10
yyyymmdd
20160901           1     1
20160902     2        1  2  1     1
20160903        1     1        1  1
20160904     1           2  1     1
20160905     2  1     4  1  1
20160906        1        1        1

Upvotes: 1

Related Questions