Sfinos
Sfinos

Reputation: 379

Fill missing values in a list of dictionaries with months and years

I prepare a data for a visualization. The data structure is like this:

data = [{u'count': 1, u'_id': {u'year': 2010, u'month': 4}}, {u'count': 1, u'_id': {u'year': 2010, u'month': 5}}, {u'count': 2, u'_id': {u'year': 2010, u'month': 7}}, {u'count': 1, u'_id': {u'year': 2010, u'month': 9}}, {u'count': 1, u'_id': {u'year': 2010, u'month': 10}}, {u'count': 4, u'_id': {u'year': 2010, u'month': 12}}]

I convert them to a list with a timestamp and the count variable with this way:

chart = []
for month in data:
      d = datetime.datetime.strptime(str(month['_id']['year'])+"-"+str(month['_id']['month']),'%Y-%m')
      dat = time.mktime(d.timetuple())
      chart.append([dat*1000,month['count']])

And the results are somehow like this (The example doesn't fit with the example of the input-data)

chart: [[1220216400000.0, 1], [1222808400000.0, 8], [1225490400000.0, 1], [1228082400000.0, 6], [1230760800000.0, 4], [1233439200000.0, 1], [1235858400000.0, 1], [1238533200000.0, 1], [1241125200000.0, 2], [1243803600000.0, 1], [1246395600000.0, 1], [1249074000000.0, 1]]

What I am trying to do is to change the code to include also the missing months between the first and the last date with count=0. For example, in the data, from year 2010 month 5 the next field is year 10 month 7. The month 6 is missing and I want to include it with count=0.

Any idea?

Upvotes: 1

Views: 956

Answers (3)

alecxe
alecxe

Reputation: 474081

Here's a solution using dateutil library for iterating over date ranges monthly.

The idea is to initialize an OrderedDict with datetime as a key and count as a value. Then for every item in the ordered dictionary, iterate monthly over the date range between current and previously added item and add 0 count:

from collections import OrderedDict
import datetime
from pprint import pprint
import time
from dateutil.rrule import rrule, MONTHLY


data = [{u'count': 1, u'_id': {u'year': 2010, u'month': 4}}, {u'count': 1, u'_id': {u'year': 2010, u'month': 5}},
        {u'count': 2, u'_id': {u'year': 2010, u'month': 7}}, {u'count': 1, u'_id': {u'year': 2010, u'month': 9}},
        {u'count': 1, u'_id': {u'year': 2010, u'month': 10}}, {u'count': 4, u'_id': {u'year': 2010, u'month': 12}}]

new_data = OrderedDict()
for item in data:
    year, month = item['_id']['year'], item['_id']['month']
    d = datetime.datetime.strptime(str(year) + "-" + str(month), '%Y-%m')
    new_data[d] = item['count']

chart = {}
last_added = None
for d, count in new_data.iteritems():
    date_start = last_added if last_added else d
    for dt in rrule(MONTHLY, dtstart=date_start, until=d):
        key = time.mktime(dt.timetuple()) * 1000
        if key not in chart:
            chart[key] = count if dt == d else 0
    last_added = d

pprint(sorted(chart.items()))

Prints:

[(1270094400000.0, 1),
 (1272686400000.0, 1),
 (1275364800000.0, 0),
 (1277956800000.0, 2),
 (1280635200000.0, 0),
 (1283313600000.0, 1),
 (1285905600000.0, 1),
 (1288584000000.0, 0),
 (1291179600000.0, 4)]

Hope it works for you.

Upvotes: 1

alecxe
alecxe

Reputation: 474081

Here's one way to do it.

The idea is to have a dictionary dat -> count. If you don't know what years are going to be in the data, you need to initialize monthly data on every iteration:

import datetime
from pprint import pprint
import time

data = [{u'count': 1, u'_id': {u'year': 2010, u'month': 4}}, {u'count': 1, u'_id': {u'year': 2010, u'month': 5}},
        {u'count': 2, u'_id': {u'year': 2010, u'month': 7}}, {u'count': 1, u'_id': {u'year': 2010, u'month': 9}},
        {u'count': 1, u'_id': {u'year': 2010, u'month': 10}}, {u'count': 4, u'_id': {u'year': 2010, u'month': 12}}]

chart = {}
for month in data:
    year = month['_id']['year']
    for m in xrange(1, 12):
        d = datetime.datetime.strptime(str(year) + "-" + str(m), '%Y-%m')
        dat = time.mktime(d.timetuple()) * 1000
        if dat not in chart:
            chart[dat] = 0

    d = datetime.datetime.strptime(str(year) + "-" + str(month['_id']['month']), '%Y-%m')
    dat = time.mktime(d.timetuple()) * 1000
    chart[dat] = month['count']

pprint(sorted(chart.items()))

If you know what years are in the data - initialize month counts before the loop over the data.

Prints:

[(1262322000000.0, 0),
 (1265000400000.0, 0),
 (1267419600000.0, 0),
 (1270094400000.0, 1),
 (1272686400000.0, 1),
 (1275364800000.0, 0),
 (1277956800000.0, 2),
 (1280635200000.0, 0),
 (1283313600000.0, 1),
 (1285905600000.0, 1),
 (1288584000000.0, 0),
 (1291179600000.0, 4)]

See - missing months are there with 0 count.

Hope that helps.

Upvotes: 1

Catalin Luta
Catalin Luta

Reputation: 781

I see that your list is sorted, so you would only have to remember the previous date (initially set to 1) and fill the list with the missing elements if there are any (i.e if the difference between month['_id']['month'] and previous date is greater than 1).

Upvotes: 0

Related Questions