Mimi Chung
Mimi Chung

Reputation: 97

Assign same index number to duplicated values up to n duplicates

I have the following column where values are duplicated any number of times:

FRUIT
Apples
Bananas
Bananas
Pear
Pear
Pear
Pear
Melon
Melon
Melon
Melon
Melon
Melon
Orange
Orange
Orange
Orange
Orange
Orange
Orange
Orange
Orange

I would like to assign an index number to each value, however for duplicated values, I want to repeat that index number up to 4 times. If the value appears 10 times, I want the index to repeat for the first four, then the index + 1 to the second four, and so on. For example:

Index    FRUIT
1        Apples
2        Bananas
2        Bananas
3        Pear
3        Pear
3        Pear
3        Pear
4        Melon
4        Melon
4        Melon
4        Melon
5        Melon
5        Melon
6        Orange
6        Orange
6        Orange
6        Orange
7        Orange
7        Orange
7        Orange
7        Orange
8        Orange

Here is my attempt:

fruit = {'FRUIT':['Apples','Bananas','Bananas','Pear','Pear','Pear','Pear','Melon','Melon','Melon','Melon','Melon','Melon','Orange','Orange','Orange','Orange','Orange','Orange','Orange','Orange','Orange']}
fruit_df = pd.DataFrame(fruit)

index = 0
index_and_fruit = []
for (columnName, columnData) in fruit_df.iteritems():
    fruit_list = fruit_df['FRUIT'].tolist()
    index = index + 1
    for i in fruit_list:
        if fruit_list.count(i) >= 4:
            index = index + 1
            index_with_fruit_list = {i:index}
            index_and_fruit.append(index_with_fruit_list)
            if fruit_list.count(i) >= 8:
                index_with_fruit_list = {i:index}
                index_and_fruit.append(index_with_fruit_list)
        else: 
            index_with_fruit_list = {i:index}
            index_and_fruit.append(index_with_fruit_list)
            print(index_and_fruit)

Upvotes: 1

Views: 418

Answers (2)

Alain T.
Alain T.

Reputation: 42139

You can use accumulate to form groups by computing the relative index of each fruit within its group. This allows you to set a maximum on the group size and reset the relative index either when changing fruit r when the maximum is reached.

With this grouping you can then assign sequential indexes based on the first item of each group (using accumulate again):

fruits = ['Apples','Bananas','Bananas','Pear','Pear','Pear','Pear','Melon','Melon',
          'Melon','Melon','Melon','Melon','Orange','Orange','Orange','Orange','Orange',
          'Orange','Orange','Orange','Orange']

from itertools import accumulate

maxGroup = 4
indexes  = range(len(fruits))
byGroup  = accumulate(indexes,lambda i,f: (i+1)*(f>0 and i<maxGroup-1 and fruits[f-1]==fruits[f]))
indexes  = [i-1 for i in accumulate(int(g==0) for g in byGroup)]
indexAndFruit = [(i,f) for i,f in zip(indexes,fruits)]

output:

for i,f in indexAndFruit: print(i,f)

0 Apples
1 Bananas
1 Bananas
2 Pear
2 Pear
2 Pear
2 Pear
3 Melon
3 Melon
3 Melon
3 Melon
4 Melon
4 Melon
5 Orange
5 Orange
5 Orange
5 Orange
6 Orange
6 Orange
6 Orange
6 Orange
7 Orange

To illustrate how this works, lets look at what the byGroup iterator will produce:

[0, 0, 1, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 0, 1, 2, 3, 0, 1, 2, 3, 0]

Each position where the indexing restarts at zero corresponds to a change of fruit or to the relative index reaching the maximum

The zeroes in this list correspond to start of groups. Flagging them as 1s with 0s for the other indexes will give the following result:

[1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1]

If we run a cumulative sum of those initial positions, we get (one-based) indexes:

[1, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8]

minus 1 gives us the desired indexes which we only need to combine with the fruits (using zip):

[0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7]

Upvotes: 1

Lorenz
Lorenz

Reputation: 1323

My take, assuming the fruits are ordered:

fruits = ['Apples', 'Bananas', 'Bananas', 'Pear', 'Pear', 'Pear',
    'Pear', 'Melon', 'Melon', 'Melon', 'Melon', 'Melon', 'Melon',
    'Orange', 'Orange', 'Orange', 'Orange', 'Orange', 'Orange',
    'Orange', 'Orange', 'Orange']

# The index for the next fruit.
current_index = 0

# The last fruit we've seen.
last_fruit = None

# The number of times we've assigned the current index to the last
# fruit already.
fruit_count = 0

for fruit in fruits:
    if fruit != last_fruit or fruit == last_fruit and fruit_count >= 4:
        # This is either
        #   (a) a new fruit, or
        #   (b) a repeated fruit to which we've assigned the current
        #       index four times already.
        # In both cases, we want to skip to the next index.
        current_index += 1
        fruit_count = 0

    last_fruit = fruit
    fruit_count += 1

    print(current_index, fruit, f"(fruit_count={fruit_count})")

Upvotes: 1

Related Questions