Reputation: 67
I wrote the following code to help me grab duplicate lines in a file and list out the line number of each duplicated line.
this code works when not in a function. But when I put the code inside a function as is shown below, it's not behaving like I'm expecting it to.
I want the values of the "getallDups" function to be stored in variable data.
#!/usr/bin/env python
filename = '/tmp/test.txt'
f = open(filename, "r")
contentAslist = f.read().splitlines()
def getallDups():
lc = 0
mystring = ""
for eitem in contentAslist:
lc += 1
if contentAslist.count(eitem) > 1:
mystring = lc,eitem
return(mystring)
data = getallDups()
print data
The above code only stores the first duplicated line. it doesn't list all the duplicated lines.
How can this code be modified to do precisely what I want? How can it be modified to store the value of the defined function in the variable "data", which I can then play with.
Upvotes: 0
Views: 285
Reputation: 3287
If you want it to return more results, it needs to calculate more results. Instead of returning the first match it finds, you need it to add that result to a list, and return the list:
contentAslist = [
"abcd",
"efgh",
"abcd",
"ijk",
"lmno",
"ijk",
"lmno",
"ijk",
]
def getallDups():
lc = 0
result = []
for eitem in contentAslist:
lc += 1
if contentAslist.count(eitem) > 1:
result.append((lc, eitem))
return result
data = getallDups()
print data
However, this is a very inefficient method, O(N^2), because list.count() method is O(N) for N items in the list, and we call it N times.
A better way is to use a hash. Note that the return type here is very different, but might be more useful, and can easily be converted to your original form.
import collections
contentAslist = [
"abcd",
"efgh",
"abcd",
"ijk",
"lmno",
"ijk",
"lmno",
"ijk",
]
def getallDups():
lc = 1
# OrderedDict is same as "{}" except that when we iterate them later they're in the order that we added them.
lhash = collections.OrderedDict()
for line in contentAslist:
# get list of line numbers matching this line, or empty list if it's the first
line_numbers = lhash.get(line, [])
# add this line number to the list
line_numbers.append(lc)
# Store the list of line numbers matching this line in the hash
lhash[line] = line_numbers
lc += 1
return lhash
data = getallDups()
for line, line_numbers in data.iteritems():
if len(line_numbers) > 1:
print line, ":",
for ln in line_numbers:
print ln,
print
The above solution is O(N).
Sample input:
abcd
efgh
abcd
ijk
lmno
ijk
lmno
ijk
Output:
abcd : 1 3
ijk : 4 6 8
lmno : 5 7
Upvotes: 1
Reputation: 148910
You put a return
statement in a loop inside a function: the return causes the function end at its first iteration... Possible ways are to return a list (and gather strings in the loop) or change the function to a generator.
Returning a list:
filename = '/tmp/test.txt'
f = open(filename, "r")
contentAslist = f.read().splitlines()
def getallDups():
mylist = []
lc = 0
for eitem in contentAslist:
lc += 1
if contentAslist.count(eitem) > 1:
mylist.append((lc, eitem)) # append the duplicated line to a list
return mylist # return the fully populated list
data = getallDups()
print data
Generator version:
filename = '/tmp/test.txt'
f = open(filename, "r")
contentAslist = f.read().splitlines()
def getallDups():
mylist = []
lc = 0
for eitem in contentAslist:
lc += 1
if contentAslist.count(eitem) > 1:
yield (lc, eitem) # yield duplicate lines one at a time
data = list(getallDups()) # build a list from the generator values
print data
Upvotes: 1
Reputation: 47988
Your trouble here is that you're returning within a loop, which means that you never get the remainder of your data. You could fix that by simply swapping return
for yield
and changing your retrieval call to:
data = list(getallDups())
This will allow your loop to complete fully.
Upvotes: 1