Reputation: 85
I'm trying to get my code to extract sentences from a file that contain certain words. I have the code seen here below:
import re
f = open('RedCircle.txt', 'r')
text = ' '.join(f.readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
def finding(q):
for item in sentences:
if item.lower().find(q.lower()) != -1:
list.append(item)
for sentence in list:
outfile.write(sentence+'\r\n')
finding('cats')
finding('apples')
finding('doggs')
But this will of course give me (in the outfile) three times the same sentence if the sentences is:
'I saw doggs and cats eating apples'
Is there a way to easily remove these duplicates, or make the code so that there will not be any duplicates in the file?
Upvotes: 1
Views: 136
Reputation: 63707
There are few options in Python that you can leverage to remove duplicate elements (In this case I believe its sentence).
All you need to do, is to collect the result in a single list and use the links provided in this answer, to create your own recipe to remove duplicates.
Also instead of dumping the result after each search to the file, defer it until all duplicates has been removed.
Few Suggestive Changes
Using Sets
Convert Your function to a Generator
def finding(q):
return (item for item in sentences
if item.lower().find(q.lower()) != -1)
Chain the result of each search
from itertools import chain
chain.from_iterable(finding(key) for key in ['cats', 'apples'. 'doggs'])
Pass the result to a Set
set(chain.from_iterable(finding(key) for key in ['cats', 'apples'. 'doggs']))
Using Decorators
def uniq(fn):
uniq_elems = set()
def handler(*args, **kwargs):
uniq_elems.update(fn(*args, **kwargs))
return uniq_elems
return handler
@uniq
def finding(q):
return (item for item in sentences
if item.lower().find(q.lower()) != -1)
If Order is Important
Change the Decorator to use OrderedDict
def uniq(fn):
uniq_elems = OrderedDict()
def handler(*args, **kwargs):
uniq_elems.update(uniq_elems.fromkeys(fn(*args, **kwargs)))
return uniq_elems.keys()
return handler
Note
list
)Upvotes: 2
Reputation: 1364
Firstly, does the order matter? Second, should duplicates appear if they're actually duplicated in the original text file?
If no to the first and yes to the second: If you rewrite the function to take a list of search strings and iterate over that (such that it checks the current sentence for each of the words you're after), then you could break out of the loop once you find it.
If yes to the first and yes to the second, Before adding an item to the list, check whether it's already there. Specifically, keep a note of which list items you've passed in the original text file and which is going to be the next one you'll see. That way you don't have to check the whole list, but only a single item.
A set as Abhijit suggests would work if you answer no to the first question and yes to the second.
Upvotes: 0