Reputation: 103
I was playing around with a function that would take 3 arguments, the name of a text file, substring1 and substring2. It would search through the text file and return the words that contained both of the substrings:
def myfunction(filename, substring1, substring2)
result = ""
text=open(filename).read().split()
for word in text:
if substring1 in word and substring2 in word:
result+=word+" "
return result
This function works, but I would like to remove the duplicate results. E.g for my specific text file if substring1 was "at" and substring2 was "wh" it would return "what", however, because there are 3 "what"s in my text file, it returns all of them. I am looking for a way to not return duplicates, only unique words, I would also like to keep the ORDER, so does that count "sets" out?
I thought maybe doing something to "text" would work, somehow removing the duplicates before the loop.
Upvotes: 2
Views: 3039
Reputation: 94575
Here is a solution that uses little memory (use of an iterator over the file lines) and has a good time complexity (which matters when the list of returned words is large, like in the case where substring1
is "a" and substring2
is "e", for English):
import collections
def find_words(file_path, substring1, substring2)
"""Return a string with the words from the given file that contain both substrings."""
matching_words = collections.OrderedDict()
with open(file_path) as text_file:
for line in text_file:
for word in line.split():
if substring1 in word and substring2 in word:
matching_words[word] = True
return " ".join(matching_words)
The OrderedDict
preserves the order in which the keys are first used, so this keeps the words in the order in which they are found. Since it is a mapping, there are no duplicate words. The good time complexity is obtained thanks to the fact that inserting a key in an OrderedDict
is done in constant time (as opposed to linear time for the if word in result_list
of many of the other solutions).
Upvotes: 2
Reputation: 24802
Please use the with statement to use the file's context manager. Using a list and testing for the presence of the string within the list will do the job for you:
def myfunction(filename, substring1, substring2)
result = []
with open(filename) as f:
for word in f.read().split():
if substring1 in word and substring2 in word:
if not word in result:
result.append(word)
return result
And consider returning a list instead of a string, as you can always convert the list into a string easily whenever you need it, doing:
r = myfunction(arg1, arg2, arg3)
print(",".join(r))
edit:
@EOL is perfectly right so here am I giving two more time efficient approach (but slightly less memory efficient):
from collections import OrderedDict
def myfunction(filename, substring1, substring2)
result = OrderedDict()
with open(filename) as f:
for word in f.read().split():
if substring1 in word and substring2 in word:
result[word] = None # here we don't care about the stored value, only the key
return result.values()
the OrderedDict
is a dictionary which preserves the order of insertion. And a dictionary's keys is a special case of set
, which share the property of having only unique values. So if a key is already in the dict, when inserted a second time, it'll be silently ignored. That operation happens way faster than looking up a value in a list.
Upvotes: 0
Reputation: 122096
I think the best way to do this, given that you want to keep the order, is to make results
a list and check each word
isn't already in the list before you add it. Also, you should really use the context manager with
to handle files, to ensure they get closed properly:
def myfunction(filename, substring1, substring2)
result = []
with open(filename) as f:
text = f.read().split()
for word in text:
if substring1 in word and substring2 in word and word not in result:
result.append(word)
return " ".join(result)
Upvotes: 0
Reputation: 1298
Nah, all you need to do is make result
a list instead of a string. Then, before adding each word, you can do if word not in result:
. You can later convert the list into a space-separated string via ''.join(result)
.
This will preserve the order in which they are found, whereas a set won't.
Upvotes: 0