Reputation: 57
I want to print all the lines in which a string occurs in the input file, along with the line numbers . So far I wrote the code shown below. It is working, but not in the way I wanted:
def index(filepath, keyword):
with open(filepath) as f:
for lineno, line in enumerate(f, start=1):
matches = [k for k in keyword if k in line]
if matches:
result = "{:<15} {}".format(','.join(matches), lineno)
print(result)
print (line)
index('deneme.txt', ['elma'])
Output is as follows:
elma 15
Sogan+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj turunçgil+Noun+A3pl ihracat+Noun+P3sg+Dat devlet+Noun destek+Noun+P3sg ver+Verb+Pass+Prog2+Cop .+Punc
So far so good, But when I enter a keyword like "Sog"
it also finds the Sogan
but I don't want that, I only want to check tokens between whitespaces. I think I need to write regex for this and I got one but I couldn't now how can i add that regex to this code.
r'[\w+]+'
Upvotes: 0
Views: 407
Reputation: 15513
Question: a keyword like "Sog" it also finds the Sogan ... I only want tokens between whitespaces. ... how can i add that regex to this code.
Build a regex
with your keywords
, use the or |
separator for multiple keywords
.
For example:
import re
def index(lines, keyword):
rc = re.compile(".*?(({})\+.+?\s)".format(keyword))
for i, line in enumerate(lines):
match = rc.match(line)
if match:
print("lines[{}] match:{}\n{}".format(i, match.groups(), line))
if __name__ == "__main__":
lines = [
'Sogan+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elmaro+Noun ve+Conj ... (omitted for brevity)',
'Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj ... (omitted for brevity)',
]
index(lines, 'elma')
index(lines, 'Sog|elma')
Output:
lines[1] match:('elma+Noun ', 'elma') Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj ... (omitted for brevity) lines[1] match:('Sog+Noun ', 'Sog') Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj ... (omitted for brevity)
Tested with Python: 3.5
Upvotes: 1
Reputation: 114320
You will probably want to use the word boundary marker \b
. This is an empty match for transitions between \w
and \W
. If you want your keywords to be literal strings, you will have to escape them first. You can combine everything into one regular expression using |
:
pattern = re.compile(r'\b(' + '|'.join(map(re.escape, keyword)) + r')\b')
OR
pattern = re.compile(r'\b(?' + '|'.join(re.escape(k) for k in keyword) + r')\b')
Computing the matches is a bit easier now, since you can use finditer
instead of making your own comprehension:
matches = pattern.finditer(line)
Since each match is enclosed in a group, printing is not much more difficult:
result = "{:<15} {}".format(','.join(m.group() for m in matches), lineno)
OR
result = "{:<15} {}".format(','.join(map(re.Match.group(), matches)), lineno)
Of course, don't forget to
import re
Corner Case
If you have keywords that are subsets of each other with the same prefix, make sure the longer ones come first. For example, if you have
keyword = ['foo', 'foobar']
The regex will be
\b(foo|foobar)\b
When you encounter a line with foobar
in it, foo
will match successfully against it and then fail against \b'. This is documented behavior of
|`. The solution is to pre-sort all your keywords by decreasing length before constructing the expression:
keywords.sort(key=len, reversed=True)
Or, if non-list inputs are possible:
keywords = sorted(keywords, key=len, reversed=True)
If you don't like this order, you can always print them in some other order after you match.
Upvotes: 1
Reputation: 61910
You could use the following regex:
import re
lines = [
'Sogan+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj turunçgil+Noun+A3pl ihracat+Noun+P3sg+Dat devlet+Noun destek+Noun+P3sg ver+Verb+Pass+Prog2+Cop .+Punc',
'Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj turunçgil+Noun+A3pl ihracat+Noun+P3sg+Dat devlet+Noun destek+Noun+P3sg ver+Verb+Pass+Prog2+Cop .+Punc',
]
keywords = ['Sog']
pattern = re.compile('(\w+)\+')
for lineno, line in enumerate(lines):
words = set(m.group(1) for m in pattern.finditer(line)) # convert to set for efficiency
matches = [keyword for keyword in keywords if keyword in words]
if matches:
result = "{:<15} {}".format(','.join(matches), lineno)
print(result)
print(line)
Output
Sog 1
Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj turunçgil+Noun+A3pl ihracat+Noun+P3sg+Dat devlet+Noun destek+Noun+P3sg ver+Verb+Pass+Prog2+Cop .+Punc
Explanation
The pattern '(\w+)\+'
any group of letters followed by a +
character, +
is special character so you need to escape it, in order to match. Then use group to extract the matching group, (i.e. the group of letters).
Further
Upvotes: 1