M3NT0
M3NT0

Reputation: 53

Merging arrays based on duplicate values on another array in python?

I've organized my data into 3 lists. The first one simply contains floating-point numbers, some of which are duplicates. The second and third lists contain 1D arrays of variable length.

The first list is sorted and all lists contain the same number of elements.

The overall format is this:

a = [1.0, 1.5, 1.5, 2 , 2]
b = [arr([1 2 3 4 10]), arr([4 8 10 11 5 6 12]), arr([1 5 7]), arr([70 1 2]), arr([1])]
c = [arr([3 4 8]), arr([5 6 12]), arr([6 7 10 123 14]), arr([70 1 2]), arr([1 5 10 4])]

I'm trying to find a way to merge the arrays in lists b and c if their corresponding float number is the same in the list a. For the example above, the desired result would be:

a = [1.0, 1.5, 2]
b = [arr([1 2 3 4 10]), arr([4 8 10 11 5 6 12 1 5 7]), arr([70 1 2 1])]
c = [arr([3 4 8]), arr([5 6 12 6 7 10 123 14]), arr([70 1 2 1 5 10 4]])]

How would I go about doing this? Does it have something to do with zip?

Upvotes: 2

Views: 1396

Answers (3)

Austin
Austin

Reputation: 26039

Since a is sorted, I would use itertools.groupby. Similar to @MadPhysicist's answer, but iterating over the zip of lists:

import numpy as np
from itertools import groupby

arr = np.array

a = [1.0, 1.5, 1.5, 2 , 2]
b = [arr([1, 2, 3, 4, 10]), arr([4, 8, 10, 11, 5, 6, 12]), arr([1, 5, 7]), arr([70, 1, 2]), arr([1])]
c = [arr([3, 4, 8]), arr([5, 6, 12]), arr([6, 7, 10, 123, 14]), arr([70, 1, 2]), arr([1, 5, 10, 4])]

res_a, res_b, res_c = [], [], []
for k, g in groupby(zip(a, b, c), key=lambda x: x[0]):
    g = list(g)
    res_a.append(k)
    res_b.append(np.concatenate([x[1] for x in g]))
    res_c.append(np.concatenate([x[2] for x in g]))

..which outputs res_a, res_b and res_c as:

[1.0, 1.5, 2]
[array([ 1,  2,  3,  4, 10]), array([ 4,  8, 10, 11,  5,  6, 12,  1,  5,  7]), array([70,  1,  2,  1])]
[array([3, 4, 8]), array([  5,   6,  12,   6,   7,  10, 123,  14]), array([70,  1,  2,  1,  5, 10,  4])]

Alternatively in case a is not sorted, you can use defaultdict:

import numpy as np
from collections import defaultdict

arr = np.array

a = [1.0, 1.5, 1.5, 2 , 2]
b = [arr([1, 2, 3, 4, 10]), arr([4, 8, 10, 11, 5, 6, 12]), arr([1, 5, 7]), arr([70, 1, 2]), arr([1])]
c = [arr([3, 4, 8]), arr([5, 6, 12]), arr([6, 7, 10, 123, 14]), arr([70, 1, 2]), arr([1, 5, 10, 4])]

res_a, res_b, res_c = [], [], []

d = defaultdict(list)

for x, y, z in zip(a, b, c):
    d[x].append([y, z])

for k, v in d.items():
    res_a.append(k)
    res_b.append(np.concatenate([x[0] for x in v]))
    res_c.append(np.concatenate([x[1] for x in v]))

Upvotes: 3

  vrnvorona
vrnvorona

Reputation: 476

EDIT: solutions above from @Austin and @Mad Physicist are better, so it's better to use them. Mine is reinventing bicycle which is not pythonic way.

I think that modifying original arrays is dangerous despite this approach using twice as much memory, but it's safe to iterate and do operations this way. What's happening:

  1. iterate over a and search for index occurencies in rest of a (we exclude current value by remove(i)
  2. if no duplicates then just copy b and c as usual
  3. if there are, then merge in temp lists, then append it to a1, b1 and c1. Block value so that duplicate value won't trigger another merge. Using if in the beginning we can check if value is blocked
  4. return new lists I didn't bother with np arrays, though i used np.where since it is a bit faster than using list comprehensions. Feel free to edit data formats etc, mine are simple for demonstration purposes.
import numpy as np
a = [1.0, 1.5, 1.5, 2, 2]
b = [[1, 2, 3, 4, 10], [4, 8, 10, 11, 5, 6, 12], [1, 5, 7], [70, 1, 2], [1]]
c = [[3, 4, 8], [5, 6, 12], [6, 7, 10, 123, 14], [70, 1, 2], [1, 5, 10, 4]]
def function(list1, list2, list3):
    a1 = []
    b1 = []
    c1 = []
    merged_list = []
    # to preserve original index we use enumerate
    for i, item in enumerate(list1):
        # to aboid merging twice we just exclude values from a we already checked
        if item not in merged_list:
            list_without_elem = np.array(list1)
            ixs = np.where(list_without_elem == item)[0].tolist() # removing our original index
            ixs.remove(i)
            # if empty append to new list as usual since we don't need merge
            if not ixs:
                a1.append(item)
                b1.append(list2[i])
                c1.append(list3[i])
                merged_list.append(item)
            else:
                temp1 = [*list2[i]] # temp b and c prefilled with first b and c
                temp2 = [*list3[i]]
                for ix in ixs:
                    [temp1.append(item) for item in list2[ix]]
                    [temp2.append(item) for item in list3[ix]]
                a1.append(item)
                b1.append(temp1)
                c1.append(temp2)
                merged_list.append(item)
    print(a1)
    print(b1)
    print(c1)
# example output
# [1.0, 1.5, 2]
# [[1, 2, 3, 4, 10], [4, 8, 10, 11, 5, 6, 12, 1, 5, 7], [70, 1, 2, 1]]
# [[3, 4, 8], [5, 6, 12, 6, 7, 10, 123, 14], [70, 1, 2, 1, 5, 10, 4]]

Upvotes: 1

Mad Physicist
Mad Physicist

Reputation: 114330

Since a is sorted, you could use itertools.groupby on the range of indices in your list, keyed by a:

from itertools import groupby

result_a = []
result_b = []
result_c = []

for _, group in groupby(range(len(a)), key=a.__getitem__):
    group = list(group)
    index = slice(group[0], group[-1] + 1)
    result_a.append(k)
    result_b.append(np.concatenate(b[index]))
    result_c.append(np.concatenate(c[index]))

group is an iterator, so you need to consume it to get the actual indices it represents. Each group contains all the indices that correspond to the same value in list_a.

slice(...) is what gets passed to list.__getitem__ any time there is a : in the indexing expression. index is equivalent to group[0]:group[-1] + 1]. This slices out the portion of the list that corresponds to each key in list_a.

Finally, np.concatenate just merges your arrays together in batches.

If you wanted to do this without doing list(group), you could consume the iterator in other ways, without keeping the values around. For example, you could get groupby to do it for you:

from itertools import groupby

result_a = []
result_b = []
result_c = []

prev = None

for _, group in groupby(range(len(a)), key=a.__getitem__):
    index = next(group)
    result_a.append(k)
    if prev is not None:
        result_b.append(np.concatenate(b[prev:index]))
        result_c.append(np.concatenate(c[prev:index]))
    prev = index

if prev is not None:
    result_b.append(np.concatenate(b[prev:]))
    result_c.append(np.concatenate(c[prev:]))

At that point, you wouldn't even really need to use groupby since it wouldn't be much more work to keep track of everything yourself:

result_a = []
result_b = []
result_c = []

k = None

for i, n in enumerate(a):
    if n == k:
        continue
    result_a.append(n)
    if k is not None:
        result_b.append(np.concatenate(b[prev:i]))
        result_c.append(np.concatenate(c[prev:i]))
    k = n
    prev = index

if k is not None:
    result_b.append(np.concatenate(b[prev:]))
    result_c.append(np.concatenate(c[prev:]))

Upvotes: 1

Related Questions