user3119123
user3119123

Reputation: 85

Remove strings containing words from list, without duplicate strings

I'm trying to get my code to extract sentences from a file that contain certain words. I have the code seen here below:

import re
f = open('RedCircle.txt', 'r')
text = ' '.join(f.readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

def finding(q):
    for item in sentences:
        if item.lower().find(q.lower()) != -1:
            list.append(item)

    for sentence in list:
        outfile.write(sentence+'\r\n')

finding('cats')
finding('apples')
finding('doggs')

But this will of course give me (in the outfile) three times the same sentence if the sentences is:

'I saw doggs and cats eating apples' 

Is there a way to easily remove these duplicates, or make the code so that there will not be any duplicates in the file?

Upvotes: 1

Views: 136

Answers (2)

Abhijit
Abhijit

Reputation: 63707

There are few options in Python that you can leverage to remove duplicate elements (In this case I believe its sentence).

  1. Using Set.
  2. Using itertools.groupby
  3. OrderedDict as an OrderedSet, if Order is important

All you need to do, is to collect the result in a single list and use the links provided in this answer, to create your own recipe to remove duplicates.

Also instead of dumping the result after each search to the file, defer it until all duplicates has been removed.

Few Suggestive Changes

Using Sets

  1. Convert Your function to a Generator

    def finding(q):
        return (item for item in sentences 
                if item.lower().find(q.lower()) != -1)
    
  2. Chain the result of each search

    from itertools import chain
    chain.from_iterable(finding(key) for key in ['cats', 'apples'. 'doggs'])
    
  3. Pass the result to a Set

    set(chain.from_iterable(finding(key) for key in ['cats', 'apples'. 'doggs']))
    

Using Decorators

def uniq(fn):
    uniq_elems = set()
    def handler(*args, **kwargs):
        uniq_elems.update(fn(*args, **kwargs))
        return uniq_elems
    return handler
@uniq
def finding(q):
    return (item for item in sentences 
            if item.lower().find(q.lower()) != -1)

If Order is Important

Change the Decorator to use OrderedDict

def uniq(fn):
    uniq_elems = OrderedDict()
    def handler(*args, **kwargs):
        uniq_elems.update(uniq_elems.fromkeys(fn(*args, **kwargs)))
        return uniq_elems.keys()
    return handler

Note

  • Refrain from naming variables that conflicts with reserve words in Python (like naming the variable as list)

Upvotes: 2

Josiah
Josiah

Reputation: 1364

Firstly, does the order matter? Second, should duplicates appear if they're actually duplicated in the original text file?

If no to the first and yes to the second: If you rewrite the function to take a list of search strings and iterate over that (such that it checks the current sentence for each of the words you're after), then you could break out of the loop once you find it.

If yes to the first and yes to the second, Before adding an item to the list, check whether it's already there. Specifically, keep a note of which list items you've passed in the original text file and which is going to be the next one you'll see. That way you don't have to check the whole list, but only a single item.

A set as Abhijit suggests would work if you answer no to the first question and yes to the second.

Upvotes: 0

Related Questions