Reputation: 153

find and update duplicates in a list of lists

I am looking for a Pythonic way to solve the following problem. I have (what I think is) a working solution but it has complicated flow controls and just isn't "pretty". (Basically, a C++ solution)

I have a list of lists. Each list contains multiple items of varying types (maybe 10 items per list) The overall order of the lists is not relevant, but the order of the items in any individual list is important. (ie I can't change it).

I am looking to "tag" duplicates by adding an extra field to the end of an individual list. However, in this case a "duplicate" list is one that has equal values in several preselected fields, but not all fields (there are no "true" duplicates).

For example: if this were the original data from a 5 item list of lists and duplicate is defined as having equal values in the first and third fields:

['apple', 'window', 'pear', 2, 1.55, 'banana']
['apple', 'orange', 'kiwi', 3, 1.80, 'banana']
['apple', 'envelope', 'star_fruit', 2, 1.55, 'banana']
['apple', 'orange', 'pear', 2, 0.80, 'coffee_cup'] 
['apple', 'orange', 'pear', 2, 3.80, 'coffee_cup']

The first, fourth and fifth lists would be duplicates and therefore all lists should be updated as follows:

['apple', 'window', 'pear', 2, 1.55, 'banana', 1]
['apple', 'orange', 'kiwi', 3, 1.55, 'banana', 0]
['apple', 'envelope', 'star_fruit', 2, 1.55,'banana', 0]
['apple', 'orange', 'pear', 2, 3.80, 'coffee_cup', 2]  
['apple', 'orange', 'pear', 2, 3.80, 'coffee_cup', 3]

Thanks for any help or direction. I think this may be getting beyond the Learning Python book.

Upvotes: 2

Answers (3)

agf

Reputation: 176800

from collections import defaultdict

lists = [['apple', 'window', 'pear', 2, 1.55, 'banana'],
['apple', 'orange', 'kiwi', 3, 1.80, 'banana'],
['apple', 'envelope', 'star_fruit', 2, 1.55, 'banana'],
['apple', 'orange', 'pear', 2, 0.80, 'coffee_cup'],
['apple', 'orange', 'pear', 2, 3.80, 'coffee_cup']]

dic = defaultdict(int)
fts = []
for lst in lists:
    first_third = lst[0], lst[2]
    dic[first_third] += 1
    if dic[first_third] == 2: fts.append(first_third)
    lst.append(dic[first_third])

for lst in lists:
    if (lst[0], lst[2]) not in fts:
        lst[-1] -= 1

print(lists)

Edit: Thanks utdemir. first_third = lst[0], lst[2] is correct, not first_third = lst[0] + lst[2]

Edit2: Changed variable names for clarity.

Edit3: Changed to reflect what the original poster really wanted, and his updated list. Not pretty any more, desired changes just tacked on.

Upvotes: 3

Don O'Donnell

Reputation: 4728

Your best bet is to sort the list first using itemgetter() to select the fields to be matched as key. This will cause all matching key fields to appear together so they can easily be compared and tagged. For example, the sort for matching the first and third fields would be:

lst.sort(key=itemgetter(0, 2))

The comparison of each item with its predecessor is straight forward.

Okay, here is the complete solution (uses itemgetter and groupby):

from operator import itemgetter
from itertools import groupby

def tagdups(input_seq, tag, key_indexes):
    keygetter = itemgetter(*key_indexes)
    sorted_list = sorted(input_seq, key=keygetter)
    for key, group in groupby(sorted_list, keygetter):
        group_list = list(group)
        if len(group_list) <= 1:
            continue
        for item in group_list:
            item.append(tag)
    return sorted_list

And here is a sample test run to show usage:

>>> samp = [[1,2,3,4,5], [1,3,5,7,7],[1,4,3,5,8],[4,3,2,7,5],[1,6,3,7,4]]
>>> tagdups(samp, 'dup', (0,2))
[[1, 2, 3, 4, 5, 'dup'], [1, 4, 3, 5, 8, 'dup'], [1, 6, 3, 7, 4, 'dup'], [1, 3, 5, 7, 7], [4, 3, 2, 7, 5]]

Upvotes: 1

utdemir

Reputation: 27216

Here is my solution(commented code):

import itertools

l = [
        ['apple', 'window', 'pear', 2, 1.55, 'banana'],
        ['apple', 'orange', 'kiwi', 3, 1.80, 'banana'],
        ['apple', 'envelope', 'star_fruit', 2, 1.55, 'banana'],
        ['apple', 'orange', 'pear', 2, 0.80, 'coffee_cup'],
        ['apple', 'orange', 'pear', 2, 3.80, 'coffee_cup']
    ]

#Here you can select the important fields 
key = lambda i: (i[0],i[2])

l.sort(key=key)
grp = itertools.groupby(l, key=key)
#Look at itertools documentation
grouped = (list(j) for i,j in grp)

for i in grouped:
    if len(i) == 1:
        i[0].append(0)
    else: #You want duplicates to start from 1
        for pos, item in enumerate(i, 1):
            item.append(pos)

#Just a little loop for flattening the list
result = [] 
for i in grouped:
    for j in i:
        result.append(j)

print(result)

Output:

[['apple', 'orange', 'kiwi', 3, 1.8, 'banana', 0],
 ['apple', 'window', 'pear', 2, 1.55, 'banana', 1],
 ['apple', 'orange', 'pear', 2, 0.8, 'coffee_cup', 2],
 ['apple', 'orange', 'pear', 2, 3.8, 'coffee_cup', 3],
 ['apple', 'envelope', 'star_fruit', 2, 1.55, 'banana', 0]]

Upvotes: 0

find and update duplicates in a list of lists

Answers (3)

Related Questions