Marwan Lkd
Marwan Lkd

Reputation: 31

Algorithm : how to split list data into fixed size sub-lists without having separate numbers

I have an algorithmic question for you. No need for an application context, I'll give you a direct example.

Here is a possible input: input = [ 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4 ]. Let's assume a batch size of 5. The idea here is to output lists of maximum size 5 without having separate number, in short: 2 identical numbers cannot be in separate sub-lists. Example output: [ [1, 1, 1], [2, 2, 2, 2], [3, 3, 4, 4, 4] ]

Assumptions: Numbers always sorted, batch_size always larger than possible number of numbers

Do you have a more elegant solution than the one I just found?

i = 0
batch_size = 5
res = []
while i < len(input):
    # Retrieve the data list according to the batch size
    data = input[i: i + size]
    # Increment the index
    i += size
    # See what's the next output looks like
    future_data = input[i: i + size]
    if future_data and future_data[0] == data[-1]:
        # So we count how many times this number appears in our current list 
        # and subtract that from our index
        cp = data.count(data[-1])
        i -= cp
        # Then remove from the current list all occurrence of that number
        data = data[:-cp]
    res.append(data)

Edit: according to @juanpa.arrivillaga's answer:

Thank you all for your reactivity and your answers.

I continue on episode 2, I gave you here my simplified problem and I thought your solution would be sufficient, despite your response, I do not see how to adapt the solution of @juanpa.arrivillaga to my data format, in fact the input would look more like :

input = { 
    'data_1' : { 
        'id': [1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4], 
        'char': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L'] 
    }
}

!the size of the lists in value of 'id' and 'char' are necessarily equal!

the output must look like:

[ 
    [1, 'A', 1, 'B', 1, 'C'], 
    [2, 'D', 2, 'E', 2, 'F', 2, 'G'], 
    [3, 'H', 3, 'I', 4, 'J', 4, 'K', 4, 'L'] 
]

I am aware that the data structure is not optimal, unfortunately I don't have the hand on it and is therefore unchangeable...

Still the same constrains as before (batch size is working only on the id, am I clear enough ?)

Upvotes: 3

Views: 523

Answers (3)

Rishabh Semwal
Rishabh Semwal

Reputation: 348

If you don't want to use any library(module), then this solution is for you.

batch_size = 5
id = list(map(int, input("Enter multiple ids in one line with space between them: ").split()))
char = list(map(str, input("Enter multiple char in one line with space between them: ").split()))
id_char_input = {char[i]: id[i] for i in range(len(id))}
id_unique = list(set([value for key, value in id_char_input.items()]))
group_input = [[] for _ in range(len(id_unique))]
output = []
j = 0

for key, value in id_char_input.items():
    group_input[id_unique.index(value)].append(value)
    group_input[id_unique.index(value)].append(key)

while j < len(group_input):
    if j != len(group_input) - 1:
        if len(group_input[j] + group_input[j + 1]) <= 2*batch_size:
            output.append(group_input[j] + group_input[j + 1])
            j += 2
        else:
            if group_input[j] not in output:
                output.append(group_input[j])
            j += 1
    else:
        output.append(group_input[j])
        break

print(output)
Output:
Enter multiple ids in one line with space between them: 1 1 1 2 2 2 2 3 3 4 4 4
Enter multiple char in one line with space between them: A B C D E F G H I J K L
[[1, 'A', 1, 'B', 1, 'C'], [2, 'D', 2, 'E', 2, 'F', 2, 'G'], [3, 'H', 3, 'I', 4, 'J', 4, 'K', 4, 'L']]

EDIT: Now you can take values of id and char from the input.

Upvotes: 0

Ajax1234
Ajax1234

Reputation: 71471

You can use itertools.groupby with a recursive generator function to find possible merges that meet your criteria. This way, you can better handle more ambiguous cases where it is unclear which "sibling" group should absorb the double pairing and/or truncated results where groups have a length greater than batch_size:

from itertools import groupby
data = {'data_1': {'id': [1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4], 'char': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']}}
batch_size = 5
def get_groups(d, c = [], p = []):
  if not d and not p and all((l:=len(i)) <= batch_size and i and l != 2 for i in c):
     #found valid combo
     yield c
  elif d:
     _p = (k:=(p+d[0]))[(b:=(l if (l:=len(k)) <= batch_size else -1*(l-batch_size))):]
     if l == 2 and not _p:
        #if group size is two, then we have to find possible merges for group
        yield from get_groups(d[1:], c=c, p = k)
        yield from get_groups([c[-1]+k]+d[1:], c=c[:-1], p = [])
     elif _p:
        #group size > batch_size, need to find possible siblings that can take subgroups of groups
        for i in range(batch_size):
           yield from get_groups(d[1:], c=c+[k[:i]], p = k[i:])
           if c and len(c[-1]) + i <= batch_size:
              yield from get_groups(d[1:], c=c[:-1]+[c[-1]+k[batch_size:batch_size+i]]+[k[:batch_size]], p = k[batch_size+i:])
     yield from get_groups(d[1:], c=c+[k[:b]], p = _p)
  elif p:
     yield from get_groups(d[1:], c=c+[p], p = [])

combo = next(get_groups([list(b) for _, b in groupby(data['data_1']['id'])]))
c_m = iter(data['data_1']['char'])
result = [[i for j in [(x, next(c_m)) for x in y] for i in j] for y in combo]

Output:

[[1, 'A', 1, 'B', 1, 'C'], [2, 'D', 2, 'E', 2, 'F', 2, 'G'], [3, 'H', 3, 'I', 4, 'J', 4, 'K', 4, 'L']]

Upvotes: 1

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 96360

Here is how I would do this, in a single pass:

>>> import itertools
>>> batch_size = 5
>>> result = [[]]
>>> input_data = [ 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4 ]
>>> for _, g in itertools.groupby(input_data):
...     current = list(g)
...     if len(result[-1]) + len(current) <= batch_size:
...         result[-1].extend(current)
...     else:
...         result.append(current)
...
>>> result
[[1, 1, 1], [2, 2, 2, 3, 3], [4, 4, 4]]

Let's break it down into intermediate steps to aid in understanding, first, this is what itertools.groupby does eagerly evaluated:

>>> import itertools
>>> batch_size = 5
>>> grouped = [list(g) for k,g in itertools.groupby([ 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4 ])]
>>> grouped
[[1, 1, 1], [2, 2, 2], [3, 3], [4, 4, 4]]

Then, simply build up your result, which is a list of sublists. If it can fit into the current sublist, add the group to the sublist, otherwise, append a new sublist consisting of that group (which we can assume is not larger than batch_size):

>>> result = [[]]
>>> for group in grouped:
...     if len(result[-1]) + len(group) <= batch_size:
...         result[-1].extend(group)
...     else:
...         result.append(group[:])
...
>>> result
[[1, 1, 1], [2, 2, 2, 3, 3], [4, 4, 4]]

The above does two passes on the data, the first example posted does one pass.

Note, if using itertools.groupby feels like "cheating", you can implement something that does the trick for this case relatively easy:

def simple_groupby(data):
    it = iter(data)
    empty = object()
    current = next(it, empty)
    if current is empty:
        return
    prev = current
    acc = [current]
    for current in it:
        if prev == current:
            acc.append(current)
        else:
            yield acc
            acc = [current]
        prev = current
    yield acc

Upvotes: 4

Related Questions