Reputation: 103

Remove duplicate words from text file input?

I was playing around with a function that would take 3 arguments, the name of a text file, substring1 and substring2. It would search through the text file and return the words that contained both of the substrings:

def myfunction(filename, substring1, substring2)
    result = ""
    text=open(filename).read().split()
    for word in text:
        if substring1 in word and substring2 in word:
            result+=word+" "
    return result

This function works, but I would like to remove the duplicate results. E.g for my specific text file if substring1 was "at" and substring2 was "wh" it would return "what", however, because there are 3 "what"s in my text file, it returns all of them. I am looking for a way to not return duplicates, only unique words, I would also like to keep the ORDER, so does that count "sets" out?

I thought maybe doing something to "text" would work, somehow removing the duplicates before the loop.

Upvotes: 2

Answers (4)

Eric O. Lebigot

Reputation: 94575

Here is a solution that uses little memory (use of an iterator over the file lines) and has a good time complexity (which matters when the list of returned words is large, like in the case where substring1 is "a" and substring2 is "e", for English):

import collections

def find_words(file_path, substring1, substring2)
    """Return a string with the words from the given file that contain both substrings."""
    matching_words = collections.OrderedDict()
    with open(file_path) as text_file:
        for line in text_file:
            for word in line.split():
                if substring1 in word and substring2 in word:
                    matching_words[word] = True
    return " ".join(matching_words)

The OrderedDict preserves the order in which the keys are first used, so this keeps the words in the order in which they are found. Since it is a mapping, there are no duplicate words. The good time complexity is obtained thanks to the fact that inserting a key in an OrderedDict is done in constant time (as opposed to linear time for the if word in result_list of many of the other solutions).

Upvotes: 2

zmo

Reputation: 24802

Please use the with statement to use the file's context manager. Using a list and testing for the presence of the string within the list will do the job for you:

def myfunction(filename, substring1, substring2)
    result = []
    with open(filename) as f:
        for word in f.read().split():
            if substring1 in word and substring2 in word:
                 if not word in result:
                     result.append(word)
        return result

And consider returning a list instead of a string, as you can always convert the list into a string easily whenever you need it, doing:

r = myfunction(arg1, arg2, arg3)
print(",".join(r))

edit:

@EOL is perfectly right so here am I giving two more time efficient approach (but slightly less memory efficient):

from collections import OrderedDict
def myfunction(filename, substring1, substring2)
    result = OrderedDict()
    with open(filename) as f:
        for word in f.read().split():
            if substring1 in word and substring2 in word:
                 result[word] = None # here we don't care about the stored value, only the key
        return result.values()

the OrderedDict is a dictionary which preserves the order of insertion. And a dictionary's keys is a special case of set, which share the property of having only unique values. So if a key is already in the dict, when inserted a second time, it'll be silently ignored. That operation happens way faster than looking up a value in a list.

Upvotes: 0

jonrsharpe

Reputation: 122096

I think the best way to do this, given that you want to keep the order, is to make results a list and check each word isn't already in the list before you add it. Also, you should really use the context manager with to handle files, to ensure they get closed properly:

def myfunction(filename, substring1, substring2)
    result = []
    with open(filename) as f:
        text = f.read().split()
    for word in text:
        if substring1 in word and substring2 in word and word not in result:
            result.append(word)
    return " ".join(result)

Upvotes: 0

henrebotha

Reputation: 1298

Nah, all you need to do is make result a list instead of a string. Then, before adding each word, you can do if word not in result:. You can later convert the list into a space-separated string via ''.join(result).

This will preserve the order in which they are found, whereas a set won't.

Upvotes: 0

Remove duplicate words from text file input?

Answers (4)

Related Questions