Delete duplicated of an object within a list

Question

I got a waveform object, define as following:

class wfm:
    """Class defining a waveform characterized by:
        - A name
        - An electrode configuration
        - An amplitude (mA)
        - A pulse width (microseconds)"""

    def __init__(self, name, config, amp, width=300):
        self.name = name
        self.config = config
        self.amp = amp
        self.width = width

    def __eq__(self, other):
        return type(other) is self.__class__ and other.name == self.name and other.config == self.config and other.amp == self.amp and other.width == self.width

    def __ne__(self, other):
        return not self.__eq__(other)

Through parsing, I get a list called waveforms with 770 instance of wfm in it. There is a lot of duplicate, and I need to delete them.

My idea was to get the ID of equivalent object, store the largest ID in a list, and then loop on all the waveforms from the end while popping out each duplicate.

Code:

duplicate_ID = []
for i in range(len(waveforms)):
    for j in range(i+1, len(waveforms)):
        if waveforms[i] == waveforms[j]:
            duplicate_ID.append(waveforms.index(waveforms[j]))
            print ('{} eq {}'.format(i, j))

duplicate_ID = list(set(duplicate_ID)) # If I don't do that; 17k IDs

Turns out (thx to the print) that I have duplicates that don't appear into the ID list, for instance 750 is a duplicate of 763 (print says it; test too) and yet none of this 2 IDs appears in my duplicate list.

I'm quite sure there is a better solution that this method (which doesn't yet work), and I would be glad to hear it. Thanks for the help!

EDIT: More complicated scenario

I've got a more complicated scenario. I got 2 classes, wfm (see above) and stim:

class stim:
    """Class defining the waveform used for a stimultion by:
        - Duration (milliseconds)
        - Frequence Hz)
        - Pattern of waveforms"""

    def __init__(self, dur, f, pattern):
        self.duration = dur
        self.fq = f
        self.pattern = pattern

    def __eq__(self, other):
        return type(other) is self.__class__ and other.duration == self.duration and other.fq == self.fq and other.pattern == self.pattern

    def __ne__(self, other):
        return not self.__eq__(other)

I parse my files to fill a dict: paradigm. It looks like that:

paradigm[file name STR] = (list of waveforms, list of stimulations)

# example:
paradigm('myfile.xml') = ([wfm1, ..., wfm10], [stim1, ..., stim5])

Once again, I want to delete the duplicates, i.e. I want to keep only the data where:

Waveforms are the same
And stim is the same

Example:

file1 has 10 waveforms and file2 has the same 10 waveforms.
file1 has stim1 and stim2 ; file2 has stim3, sitm 4 and stim 5.

stim1 and stim3 are the same; so since the waveforms are also the same, I want to keep:
file1: 10 waveforms and stim1 and stim2
file2: 10 waveforms and stim 4 and stim5

The correlation is kinda messy in my head, so I got a few difficulties finding the right storage solution for wave forms and stimulation in order to compare them easly. If you got any idea, I'd be glad to hear it. Thanks!

Alex · Accepted Answer

Problem

The .index method uses the .__eq__ method you overloaded. So

waveforms.index(waveforms[j])

will always find the first instance of the waveform in the list containing the same attributes as waveforms[j].

w1 = wfm('a', {'test_param': 4}, 3, 2.0)
w2 = wfm('b', {'test_param': 4}, 3, 2.0)
w3 = wfm('a', {'test_param': 4}, 3, 2.0)

w1 == w3  # True
w2 == w3  # False

waveforms = [w1, w2, w3]
waveforms.index(waveforms[2]) == waveforms.index(waveforms[0]) == 0  # True

Solution

Immutable

You don't need to store list indices if you do this immutably:

key = lambda w: hash(str(vars(w)))
dupes = set()
unique = [dupes.add(key(w)) or w for w in waveforms if key(w) not in dupes]

unique == [w1, w2]  # True

Mutable

key = lambda w: hash(str(vars(w)))
seen = set()
idxs = [i if key(w) in seen else seen.add(key(w)) for i, w in enumerate(waveforms)]

for idx in filter(None, idxs[::-1]):
    waveforms.pop(idx)

waveforms == [w1, w2]  # True

Big O Analysis

It's a good habit to think about big O complexity when writing algorithms (though optimization should come at the cost of readability only when needed). In this case, these solutions are a bit more readable and also big O optimal.

Your initial solution was O(n^2) due to the double for loop.

Both solutions provided are O(n).