Reputation: 122052
I want to check a set of sentences and see whether some seed words occurs in the sentences. but i want to avoid using for seed in line
because that would have say that a seed word ring
would have appeared in a doc with the word bring
.
I also want to check whether multiword expressions (MWE) like word with spaces
appears in the document.
I've tried this but this is uber slow, is there a faster way of doing this?
seed = ['words with spaces', 'words', 'foo', 'bar',
'bar bar', 'foo foo foo bar', 'ring']
docs = ['these are words with spaces but the drinks are the bar is also good',
'another sentence at the foo bar is here',
'then a bar bar black sheep,
'but i dont want this sentence because there is just nothing that matches my list',
'i forgot to bring my telephone but this sentence shouldn't be in the seeded docs too']
docs_seed = []
for d in docs:
toAdd = False
for s in seeds:
if " " in s:
if s in d:
toAdd = True
if s in d.split(" "):
toAdd = True
if toAdd == True:
docs_seed.append((s,d))
break
print docs_seed
The desired output should be this:
[('words with spaces','these are words with spaces but the drinks are the bar is also good')
('foo','another sentence at the foo bar is here'),
('bar', 'then a bar bar black sheep')]
Upvotes: 1
Views: 914
Reputation: 179422
Consider using a regular expression:
import re
pattern = re.compile(r'\b(?:' + '|'.join(re.escape(s) for s in seed) + r')\b')
pattern.findall(line)
\b
matches the start or end of a "word" (sequence of word characters).
Example:
>>> for line in docs:
... print pattern.findall(line)
...
['words with spaces', 'bar']
['foo', 'bar']
['bar', 'bar']
[]
[]
Upvotes: 3
Reputation: 5149
This should work and be somewhat faster than your current approach:
docs_seed = []
for d in docs:
for s in seed:
pos = d.find(s)
if not pos == -1 and (d[pos - 1] == " "
and (d[pos + len(s)] == " " or pos + len(s) == len(d))):
docs_seed.append((s, d))
break
find
gives us the position of the seed
value in the doc (or -1 if it is not found), we then check that the characters before and after the value are spaces (or the string ends after the substring). This also fixes the bug in your original code that multiword expressions don't need to start or end on a word boundary - your original code would match "words with spaces"
for an input like "swords with spaces"
.
Upvotes: 0