homerMeng
homerMeng

Reputation: 71

Eliminating duplicated elements in a list

I was trying chp 10.15 in book Think Python and wrote following codes:

def turn_str_to_list(string):
    res = []
    for letter in string:
        res.append(letter)
    return res

def sort_and_unique (t):
    t.sort()
    for i in range (0, len(t)-2, 1):
        for j in range (i+1, len(t)-1, 1):
            if t[i]==t[j]:
                del t[j]
    return t

line=raw_input('>>>')
t=turn_str_to_list(line)
print t
print sort_and_unique(t)

I used a double 'for' structure to eliminate any duplicated elements in a sorted list. However, when I ran it, I kept getting wrong outputs. if I input 'committee', the output is ['c', 'e', 'i', 'm', 'o', 't', 't'], which is wrong because it still contains double 't'. I tried different inputs, sometimes the program can't pick up duplicated letters in middle of the list, and it always can not pick up the ones at the end. What was I missing? Thanks guys.

Upvotes: 3

Views: 171

Answers (6)

Jan Vlcinsky
Jan Vlcinsky

Reputation: 44092

So you want to have explained, what is wrong in your code. Here you are:

Before we dive into coding, make test case(s)

It would make our coding faster, if we get test case at hand from very begining

For testing I will make small utility function:

def textinout(text):
    return "".join(sort_and_unique(list(text)))

This allows quick test like:

>>> textinout("committee")
"ceimot"

and another helper function for readable error traces:

def checkit(textin, expected):
    msg = "For input '{textin}' we expect '{expected}', got '{result}'"
    result = textinout(textin)
    assert result == expected, msg.format(textin=textin, expected=expected, result=result)

And make the test case function:

def testit():
    checkit("abcd", 'abcd')
    checkit("aabbccdd", 'abcd')
    checkit("a", 'a')
    checkit("ddccbbaa", 'abcd')
    checkit("ddcbaa", 'abcd')
    checkit("committee", 'ceimot')

Let us make first test with existing function:

def sort_and_unique (t):
    t.sort()
    for i in range (0, len(t)-2, 1):
        for j in range (i+1, len(t)-1, 1):
            if t[i]==t[j]:
                del t[j]
    return t

Now we can test it:

testit()
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-11-291a15d81032> in <module>()
----> 1 testit()

<ipython-input-4-d8ad9abb3338> in testit()
      1 def testit():
      2         checkit("abcd", 'abcd')
----> 3         checkit("aabbccdd", 'abcd')
      4         checkit("a", 'a')
      5         checkit("ddccbbaa", 'abcd')

<ipython-input-10-620ac3b14f51> in checkit(textin, expected)
      2     msg = "For input '{textin}' we expect '{expected}', got '{result}'"
      3     result = textinout(textin)
----> 4     assert result == expected, msg.format(textin=textin, expected=expected, result=result)

AssertionError: For input 'aabbccdd' we expect 'abcd', got 'abcdd'

Reading the last line of error trace we know, what is wrong.

General comments to your code

Accessing list members via index

In most cases this is not efficient and it makes the code hard to read.

Instead of:

lst = ["a", "b", "c"]
for i in range(len(lst)):
    itm = lst[i]
    # do something with the itm

You should use:

lst = ["a", "b", "c"]
for itm in lst:
    # do something with the itm
    print itm

If you need to access subset of a list, use slicing

Instead of:

for i in range (0, len(lst)-2, 1):
    itm = lst[i]

Use:

for itm in lst[:-2]:
    # do something with the itm
    print itm

If you really need to know position of processed item for inner loops, use enumerate:

Instead of:

lst = ["a", "b", "c", "d", "e"]
for i in range(0, len(lst)):
    for j in range (i+1, len(lst)-1, 1):
        itm_i = lst[i]
        itm_j = lst[j]
        # do something

Use enumerate, which turn each list item into tuple (index, item):

lst = ["a", "b", "c", "d", "e"]
for i, itm_i in enumerate(lst):
    for itm_j in lst[i+1, -1]
        print itm_i, itm_j
        # do something

Manipulating a list which is processed

You are looping over a list and suddenly delete an item from it. List modification during iteration is generally better to avoid, if you have to do it, you have to think twice and take care, like iterating backward so that you do not modify that part, which is about to be processed in some next iteration.

As alternative to deleting an item from iterated list you can note findings (like duplicated items) to another list and after you are out of the loop, use it somehow.

How could be your code rewritten

def sort_and_unique (lst):
    lst.sort()
    to_remove = []
    for i, itm_i in enumerate(lst[:-2]):
        for j, itm_j in enumerate(lst[i+1: -1]):
            if itm_i == itm_j:
                to_remove.append(itm_j)
    # now we are out of loop and can modify the lst
    # note, we loop over one list and modify another, this is safe
    for itm in to_remove:
        lst.remove(itm)
    return lst

Reading the code, the problem turns out: you never touch last item in the sorted list. That is why you do not get "t" removed as it is alphabetically the last item after applying sort.

So your code could be corrected this way:

def sort_and_unique (lst):
    lst.sort()
    to_remove = []
    for i, itm_i in enumerate(lst[:-1]):
        for j, itm_j in enumerate(lst[i+1:]):
            if itm_i == itm_j:
                to_remove.append(itm_j)
    for itm in to_remove:
        lst.remove(itm)
    return lst

From now on, the code is correct, and you shall prove it by calling testit()

>>> testit()

Silent test output is what we were dreaming about.

Having the test function make further code modification easy, as it will be quick to check, if things are still working as expected.

Anyway, the code can be shortened by getting tuples (itm_i, itm_j) using zip

def sort_and_unique (lst):
    lst.sort()
    to_remove = []
    for itm_i, itm_j in zip(lst[:-1], lst[1:]):
        if itm_i == itm_j:
            to_remove.append(itm_j)
    for itm in to_remove:
        lst.remove(itm)
    return lst

Test it:

>>> testit()

or using list comprehension:

def sort_and_unique (lst):
    lst.sort()
    to_remove = [itm_j for itm_i, itm_j in zip(lst[:-1], lst[1:]) if itm_i == itm_j]
    for itm in to_remove:
        lst.remove(itm)
    return lst

Test it:

>>> testit()

As list comprehension (using []) completes creation of returned value sooner then are the values used, we can remove another line:

def sort_and_unique (lst):
    lst.sort()
    for itm in [itm_j for itm_i, itm_j in zip(lst[:-1], lst[1:]) if itm_i == itm_j]:
        lst.remove(itm)
    return lst

Test it:

>>> testit()

Note, that so far, the code still reflects your original algorithm, only two bugs were removed:

- not manipulating list, we are iterating over
- taking into account also last item from the list

Upvotes: 0

nitekrawler
nitekrawler

Reputation: 425

The reason why your program isn't removing all the duplicate letters is because the use of del t[j] in the nested for-loops is causing the program to skip letters.

I added some prints to help illustrate this:

def sort_and_unique (t):
    t.sort()
    for i in range (0, len(t)-2, 1):
        print "i: %d" % i
        print t
        for j in range (i+1, len(t)-1, 1):
            print "\t%d %s len(t):%d" % (j, t[j], len(t))
            if t[i]==t[j]:
                print "\tdeleting %c" % t[j]
                del t[j]
    return t

Output:

>>>committee
['c', 'o', 'm', 'm', 'i', 't', 't', 'e', 'e']
i: 0
['c', 'e', 'e', 'i', 'm', 'm', 'o', 't', 't']
        1 e len(t):9
        2 e len(t):9
        3 i len(t):9
        4 m len(t):9
        5 m len(t):9
        6 o len(t):9
        7 t len(t):9
i: 1
['c', 'e', 'e', 'i', 'm', 'm', 'o', 't', 't']
        2 e len(t):9
        deleting e
        3 m len(t):8
        4 m len(t):8
        5 o len(t):8
        6 t len(t):8
        7 t len(t):8
i: 2
['c', 'e', 'i', 'm', 'm', 'o', 't', 't']
        3 m len(t):8
        4 m len(t):8
        5 o len(t):8
        6 t len(t):8
i: 3
['c', 'e', 'i', 'm', 'm', 'o', 't', 't']
        4 m len(t):8
        deleting m
        5 t len(t):7
        6 t len(t):7
i: 4
['c', 'e', 'i', 'm', 'o', 't', 't']
        5 t len(t):7
i: 5
['c', 'e', 'i', 'm', 'o', 't', 't']
i: 6
['c', 'e', 'i', 'm', 'o', 't', 't']
['c', 'e', 'i', 'm', 'o', 't', 't']

Whenever del t[j] is called, the list becomes one element smaller but the inner j variable for-loops keeps iterating.

For example:

i=1, j=2, t = ['c', 'e', 'e', 'i', 'm', 'm', 'o', 't', 't']

It sees that t[1] == t[2] (both 'e') so it removes t[2].

Now t = ['c', 'e', 'i', 'm', 'm', 'o', 't', 't']

However, the code continues with i=1, j=3, which compares 'e' to 'm' and skips over 'i'.

Lastly, it is not catching the last two 't's because by the time i=5, len(t) is 7, so the conditions of the inner for-loop is range(6,6,1) and is not executed.

Upvotes: 4

hd1
hd1

Reputation: 34657

Here you go:

In [1]: word = 'committee'

In [3]: word_ = set(word)

In [4]: word_
Out[4]: {'c', 'e', 'i', 'm', 'o', 't'}

The standard way to check for unique elements in python is to use a set. The constructor of a set takes any sequential object. A string is a collection of sequential ascii codes (or unicode codepoints), so it qualifies.

If you have further problems, do leave a comment.

Upvotes: 1

Bharat
Bharat

Reputation: 3000

In python you could make use of the inbuilt data structures and library functions like set() & list()

Your turn_str_to_list() can be done with list(). Maybe you know this but wanted to do it on your own.

Using the list() and set() APIs:

line=raw_input('>>>')
print list(set(line))

Your sort_and_unique() has a O(n^2) complexity. One of the ways to make cleaner:

def sort_and_unique2(t):
    t.sort()
    res = []
    for i in t:
        if i not in res:
            res.append(i)

    return res 

This would still be O(n^2) since look up (i not in res) would be linear time, but code looks a bit cleaner. Deletion has complexity O(n), so instead you could do append to new list since append is O(1). See this for complexities of list API: https://wiki.python.org/moin/TimeComplexity

Upvotes: 2

Jan Vlcinsky
Jan Vlcinsky

Reputation: 44092

Solution explained:

>>> word = "committee"

Turn string to list of characters:

>>> clst = list(word)
>>> clst
['c', 'o', 'm', 'm', 'i', 't', 't', 'e', 'e']

Use set to get only unique items:

>>> unq_clst = set(clst)
>>> unq_clst
{'c', 'e', 'i', 'm', 'o', 't'}

It turns out (thanks Blckknght), that the list step is not necessary and we could do that this way:

>>> unq_clst = set(word)
{'c', 'e', 'i', 'm', 'o', 't'}

Both, set and list are taking as parameter an iterable, and iterating over string returns one character by another.

Sort it:

>>> sorted(unq_clst)
['c', 'e', 'i', 'm', 'o', 't']

One line version:

>>> sorted(set("COMMITTEE"))
['C', 'E', 'I', 'M', 'O', 'T']

Upvotes: 1

bigpotato
bigpotato

Reputation: 219

You can try the following code snippet

s = "committe"
res = sorted((set(list(s))))

Upvotes: 1

Related Questions