erkevarol
erkevarol

Reputation: 57

Python - All the lines and line numbers in which string occurs in the input file

I want to print all the lines in which a string occurs in the input file, along with the line numbers . So far I wrote the code shown below. It is working, but not in the way I wanted:

def index(filepath, keyword):

    with open(filepath) as f:
        for lineno, line in enumerate(f, start=1):
            matches = [k for k in keyword if k in line]
            if matches:
                result = "{:<15} {}".format(','.join(matches), lineno)
                print(result)
                print (line)

index('deneme.txt', ['elma'])

Output is as follows:

elma            15
Sogan+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj turunçgil+Noun+A3pl ihracat+Noun+P3sg+Dat devlet+Noun destek+Noun+P3sg ver+Verb+Pass+Prog2+Cop .+Punc  

So far so good, But when I enter a keyword like "Sog" it also finds the Sogan but I don't want that, I only want to check tokens between whitespaces. I think I need to write regex for this and I got one but I couldn't now how can i add that regex to this code.

r'[\w+]+'

Upvotes: 0

Views: 407

Answers (3)

stovfl
stovfl

Reputation: 15513

Question: a keyword like "Sog" it also finds the Sogan ... I only want tokens between whitespaces. ... how can i add that regex to this code.

Build a regex with your keywords, use the or | separator for multiple keywords.

For example:

import re

def index(lines, keyword):
    rc = re.compile(".*?(({})\+.+?\s)".format(keyword))

    for i, line in enumerate(lines):
        match = rc.match(line)
        if match:
            print("lines[{}] match:{}\n{}".format(i, match.groups(), line))

if __name__ == "__main__":
    lines = [
    'Sogan+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elmaro+Noun ve+Conj ... (omitted for brevity)',
    'Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj ... (omitted for brevity)',
]
    index(lines, 'elma')
    index(lines, 'Sog|elma')

Output:

lines[1] match:('elma+Noun ', 'elma')
Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj ... (omitted for brevity)
lines[1] match:('Sog+Noun ', 'Sog')
Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj ... (omitted for brevity)

Tested with Python: 3.5

Upvotes: 1

Mad Physicist
Mad Physicist

Reputation: 114320

You will probably want to use the word boundary marker \b. This is an empty match for transitions between \w and \W. If you want your keywords to be literal strings, you will have to escape them first. You can combine everything into one regular expression using |:

pattern = re.compile(r'\b(' + '|'.join(map(re.escape, keyword)) + r')\b')

OR

pattern = re.compile(r'\b(?' + '|'.join(re.escape(k) for k in keyword) + r')\b')

Computing the matches is a bit easier now, since you can use finditer instead of making your own comprehension:

matches = pattern.finditer(line)

Since each match is enclosed in a group, printing is not much more difficult:

result = "{:<15} {}".format(','.join(m.group() for m in matches), lineno)

OR

result = "{:<15} {}".format(','.join(map(re.Match.group(), matches)), lineno)

Of course, don't forget to

import re

Corner Case

If you have keywords that are subsets of each other with the same prefix, make sure the longer ones come first. For example, if you have

keyword = ['foo', 'foobar']

The regex will be

\b(foo|foobar)\b

When you encounter a line with foobar in it, foo will match successfully against it and then fail against \b'. This is documented behavior of|`. The solution is to pre-sort all your keywords by decreasing length before constructing the expression:

keywords.sort(key=len, reversed=True)

Or, if non-list inputs are possible:

keywords = sorted(keywords, key=len, reversed=True)

If you don't like this order, you can always print them in some other order after you match.

Upvotes: 1

Dani Mesejo
Dani Mesejo

Reputation: 61910

You could use the following regex:

import re

lines = [
    'Sogan+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj turunçgil+Noun+A3pl ihracat+Noun+P3sg+Dat devlet+Noun destek+Noun+P3sg ver+Verb+Pass+Prog2+Cop .+Punc',
    'Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj turunçgil+Noun+A3pl ihracat+Noun+P3sg+Dat devlet+Noun destek+Noun+P3sg ver+Verb+Pass+Prog2+Cop .+Punc',
]

keywords = ['Sog']
pattern = re.compile('(\w+)\+')

for lineno, line in enumerate(lines):
    words = set(m.group(1) for m in pattern.finditer(line))  # convert to set for efficiency
    matches = [keyword for keyword in keywords if keyword in words]
    if matches:
        result = "{:<15} {}".format(','.join(matches), lineno)
        print(result)
        print(line)

Output

Sog             1
Sog+Noun ,+Punc domates+Noun ,+Punc patates+Noun ,+Punc elma+Noun ve+Conj turunçgil+Noun+A3pl ihracat+Noun+P3sg+Dat devlet+Noun destek+Noun+P3sg ver+Verb+Pass+Prog2+Cop .+Punc

Explanation

The pattern '(\w+)\+' any group of letters followed by a + character, + is special character so you need to escape it, in order to match. Then use group to extract the matching group, (i.e. the group of letters).

Further

  1. Regular expression syntax

Upvotes: 1

Related Questions