swenson
swenson

Reputation: 67

python unable to store function results in variable

I wrote the following code to help me grab duplicate lines in a file and list out the line number of each duplicated line.

this code works when not in a function. But when I put the code inside a function as is shown below, it's not behaving like I'm expecting it to.

I want the values of the "getallDups" function to be stored in variable data.

#!/usr/bin/env python

filename = '/tmp/test.txt'
f = open(filename, "r")
contentAslist = f.read().splitlines()
def getallDups():
    lc = 0
    mystring = ""
    for eitem in contentAslist:
        lc += 1
        if contentAslist.count(eitem) > 1:
            mystring = lc,eitem
            return(mystring)

data = getallDups()
print data

The above code only stores the first duplicated line. it doesn't list all the duplicated lines.

How can this code be modified to do precisely what I want? How can it be modified to store the value of the defined function in the variable "data", which I can then play with.

Upvotes: 0

Views: 285

Answers (3)

Jeff Learman
Jeff Learman

Reputation: 3287

If you want it to return more results, it needs to calculate more results. Instead of returning the first match it finds, you need it to add that result to a list, and return the list:

contentAslist = [
    "abcd",
    "efgh",
    "abcd",
    "ijk",
    "lmno",
    "ijk",
    "lmno",
    "ijk",
]

def getallDups():
    lc = 0
    result = []
    for eitem in contentAslist:
        lc += 1
        if contentAslist.count(eitem) > 1:
            result.append((lc, eitem))
    return result

data = getallDups()
print data

However, this is a very inefficient method, O(N^2), because list.count() method is O(N) for N items in the list, and we call it N times.

A better way is to use a hash. Note that the return type here is very different, but might be more useful, and can easily be converted to your original form.

import collections
contentAslist = [
    "abcd",
    "efgh",
    "abcd",
    "ijk",
    "lmno",
    "ijk",
    "lmno",
    "ijk",
]
def getallDups():
    lc = 1
    # OrderedDict is same as "{}" except that when we iterate them later they're in the order that we added them.
    lhash = collections.OrderedDict()
    for line in contentAslist:
        # get list of line numbers matching this line, or empty list if it's the first
        line_numbers = lhash.get(line, [])
        # add this line number to the list
        line_numbers.append(lc)
        # Store the list of line numbers matching this line in the hash
        lhash[line] = line_numbers
        lc += 1

    return lhash

data = getallDups()

for line, line_numbers in data.iteritems():
    if len(line_numbers) > 1:
        print line, ":",
        for ln in line_numbers:
            print ln,
        print

The above solution is O(N).

Sample input:

abcd
efgh
abcd
ijk
lmno
ijk
lmno
ijk

Output:

abcd : 1 3
ijk : 4 6 8
lmno : 5 7

Upvotes: 1

Serge Ballesta
Serge Ballesta

Reputation: 148910

You put a return statement in a loop inside a function: the return causes the function end at its first iteration... Possible ways are to return a list (and gather strings in the loop) or change the function to a generator.

Returning a list:

filename = '/tmp/test.txt'
f = open(filename, "r")
contentAslist = f.read().splitlines()
def getallDups():
    mylist = []
    lc = 0
    for eitem in contentAslist:
        lc += 1
        if contentAslist.count(eitem) > 1:
            mylist.append((lc, eitem))      # append the duplicated line to a list
    return mylist                           # return the fully populated list

data = getallDups()
print data

Generator version:

filename = '/tmp/test.txt'
f = open(filename, "r")
contentAslist = f.read().splitlines()
def getallDups():
    mylist = []
    lc = 0
    for eitem in contentAslist:
        lc += 1
        if contentAslist.count(eitem) > 1:
            yield (lc, eitem)    # yield duplicate lines one at a time

data = list(getallDups())        # build a list from the generator values
print data

Upvotes: 1

g.d.d.c
g.d.d.c

Reputation: 47988

Your trouble here is that you're returning within a loop, which means that you never get the remainder of your data. You could fix that by simply swapping return for yield and changing your retrieval call to:

data = list(getallDups())

This will allow your loop to complete fully.

Upvotes: 1

Related Questions