Strings
Strings

Reputation: 133

How can I efficiently merge these two datasets?

So I have two lists of data, looking like this (shortened):

[[1.0, 1403603100],
 [0.0, 1403603400],
 [2.0, 1403603700],
 [0.0, 1403604000],
 [None, 1403604300]]

[1.0, 1403603100],
[0.0, 1403603400],
[1.0, 1403603700],
[None, 1403604000],
[5.0, 1403604300]]

What I'm wanting to do is merge them, summing the first elements of each dataset, or making it 0.0 if either counter value is None. So the above example would become this:

[[2.0, 1403603100],
[0.0, 1403603400],
[3.0, 1403603700],
[0.0, 1403604000],
[0.0, 1403604300]]

This is what I've come up with so far, apologies if it's a bit cludgy.

def emit_datum(datapoints):
    for datum in datapoints:
        yield datum

def merge_data(data_set1, data_set2):

    assert len(data_set1) == len(data_set2)
    data_length = len(data_set1)

    data_gen1 = emit_datum(data_set1)
    data_gen2 = emit_datum(data_set2)

    merged_data = []

    for _ in range(data_length):

        datum1 = data_gen1.next()
        datum2 = data_gen2.next()

        if datum1[0] is None or datum2[0] is None:
            merged_data.append([0.0, datum1[1]])
            continue

        count = datum1[0] + datum2[0]
        merged_data.append([count, datum1[1]])

    return merged_data

I can only hope/assume that there's something cunning I can do with itertools or collections?

Upvotes: 3

Views: 277

Answers (4)

Padraic Cunningham
Padraic Cunningham

Reputation: 180482

If you are making both values equal to 0.0 if either are None you just need a simple loop.

 l1 = [1.0, 1403603100],
 [0.0, 1403603400],
 [2.0, 1403603700],
 [0.0, 1403604000],
 [None, 1403604300]]

l2 = [[1.0, 1403603100],
[0.0, 1403603400],
[1.0, 1403603700],
[None, 1403604000],
[5.0, 1403604300]]

final = []
assert len(l1)== len(l2)
for x, y in zip(l1, l2):
    if x[0] is  None or y[0] is  None:
        y[0] = 0.0
        final.append(y)
    else:
        final.append([x[0] + y[0], x[-1]])
print final

[[2.0, 1403603100], [0.0, 1403603400], [3.0, 1403603700], [0.0, 1403604000], [0.0, 1403604300]]


In [51]: %timeit merge_data(l1,l2)
100000 loops, best of 3: 5.76 µs per loop


 In [52]: %%timeit                 
   ....: final = []
   ....: assert len(l1)==len(l2)
   ....: for x, y in zip(l1, l2):
   ....:     if x[0] is  None or y[0] is None:
   ....:         y[0] = 0.0
   ....:         final.append(y)
   ....:     else:
   ....:         final.append([x[0] + y[0], x[-1]])
   ....: 
100000 loops, best of 3: 2.64 µs per loop

Upvotes: 1

Ian Knight
Ian Knight

Reputation: 2476

You can make use of zip, like so:

def merge(list1, list2):
    returnlist = []
    for x, y in zip(list1, list2):
        if x[0] is None or y[0] is None:
            returnlist.append([0.0, x[1]])
        else:
            returnlist.append([x[0] + y[0], x[1]])

    return returnlist

zip returns an iterator over tuples containing elements from each input list with the same index (i.e. (list1[0], list2[0]), (list1[1], list2[1]), etc.)

Upvotes: 0

Dr. Jan-Philip Gehrcke
Dr. Jan-Philip Gehrcke

Reputation: 35796

What about 'binning' the data based on the identifier, i.e. collecting all values corresponding to one identifier (e.g. 1403603400), and later sum it up. A dictionary is great for collecting all values corresponding to an identifier (key), and a defaultdict of type list makes this especially simple:

>>> data = [[1.0, 1403603100],  [1.0, 1403603100],
...  [0.0, 1403603400],  [0.0, 1403603400],
...  [2.0, 1403603700],  [1.0, 1403603700],
...  [0.0, 1403604000],  [None, 1403604000],
...  [None, 1403604300],  [5.0, 1403604300]]

>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> for value, identifier in data:
...     d[identifier].append(value)
... 

Now we have the data sorted and can conditionally sum it:

>>> for identifier, valuelist in d.iteritems():
...     if not None in valuelist:
...         print identifier, sum(valuelist)
...     else:
...         print identifier, 0.0
... 
1403603400 0.0
1403603700 3.0
1403603100 2.0
1403604300 0.0
1403604000 0.0

The last part in short, to obtain the list that you wanted to:

>>> [[i, sum(v)] if None not in v else [i, .0] for i, v in d.iteritems()]
[[1403603400, 0.0], [1403603700, 3.0], [1403603100, 2.0], [1403604300, 0.0], [1403604000, 0.0]]

That approach requires the data sets to be mixed in the first place, as you had in the first version of your example input.

Upvotes: 1

Moj
Moj

Reputation: 6361

use numpy array and you don't need to do any looping.this make your code faster if you are dealing with larger datasets.

import numpy as np

In [68]: a = np.asarray(a)


In [69]: b = np.asarray(b)

In [71]: a_none_idx = np.equal(a,None)

In [72]: b_none_idx = np.equal(b,None)

In [73]: a[a_none_idx]=0

In [74]: b[b_none_idx]=0

In [76]: c = np.zeros(a.shape)

In [77]: c[:,0]= a[:,0] + b[:,0]

In [78]: c
Out[78]: 
array([[ 2.,  0.],
       [ 0.,  0.],
       [ 3.,  0.],
       [ 0.,  0.],
       [ 5.,  0.]])

In [79]: c[a_none_idx]=0

In [80]: c[b_none_idx]=0

In [81]: c[:,1] = a[:,1]

In [82]: c
Out[82]: 
array([[  2.00000000e+00,   1.40360310e+09],
       [  0.00000000e+00,   1.40360340e+09],
       [  3.00000000e+00,   1.40360370e+09],
       [  0.00000000e+00,   1.40360400e+09],
       [  0.00000000e+00,   1.40360430e+09]]

Upvotes: 0

Related Questions