Oliver
Oliver

Reputation: 2212

Regex matching newline before word in python

I have a pattern: "\nvariable WORD"

This pattern shows up a lot of times in the string and I want a list of indexes that this pattern shows up at. "WORD" is fixed, and doesn't change from instance to instance, but "variable" varies in content and length.

In python, I know this matches all WORD and returns their indices in a list:

contents="some long string"
print [m.start() for m in re.finditer('WORD',contents)]

So in short, how do I find indices of all "variable" after \n but before "WORD"?

Upvotes: 1

Views: 1787

Answers (4)

Oliver
Oliver

Reputation: 2212

Ah, well, it turned out that the text actually contained ctrl-M return characters instead of newline characters, which drove me crazy. I removed those and I just used:

[m.start() for m in re.finditer('\w+\sWORD',contents)]

Thanks for all the help! Simpleparser also works, of course.

Upvotes: 0

Paulo Scardine
Paulo Scardine

Reputation: 77291

If the only tool you know is a hammer, every problem looks like a nail.

Regular expressions are powerful hammers, but sometimes not the best tool for the task in hand. In fact, regular expressions are abused a lot, I feel shivers down the spine every time someone asks me to check complex regular expressions from other programmer (often I'm unable understand mine after a few weeks).

On the other side, EBNF (Extended Backus–Naur Form) notation is a lot easier to understand and maintain.

from simpleparse.parser import Parser

grammar = r"""
<space>      := [ \t]
<newline>    := '\n'
<identifier> := [A-Za-z_],[A-Za-z0-9z_]*
match        := newline,identifier,space+,'WORD'
<junk>       := newline*,identifier,space+,-'WORD',(identifier/space)*
data         := (match/junk)*
"""

parser = Parser(grammar, 'data')

data = 'some junk\nvariable1 WORD\nvariable2 some ' +\
       'junk\nvariable3 WORD\nvariable4 some other ' +\
       'junk\nvariable5 WORD'

(start, matches, stop) = parser.parse(data)

print [ start for name, start, stop, other in matches ]

This will print:

[9, 44, 85]

Upvotes: 3

Unpaid Oracles
Unpaid Oracles

Reputation: 2247

Would this sufice?

>>> import re
>>> s = '\nvariable1 WORD\nvariable2 WORD\nvariable3 WORD\nvariable4 WORD\nvariable5 WORD'
>>> re.findall(r'\n(\w+)\s+WORD', s)
['variable1', 'variable2', 'variable3', 'variable4', 'variable5']

What do you need the indexes for?

Upvotes: 2

tharen
tharen

Reputation: 1300

You may need to offset the indices from the start points depending on your objective. If by '\n' you are expecting newlines then you will have to include the MULTILINE flag in the compile.

import re

mytext='\nvar1 WORD\nvar2 WORD\nvar3 WORD'
#compile a pattern to find the 'var*' after \n
pat = re.compile('\n(.*?)\s+WORD')

results = re.finditer(pat,mytext)

for result in results:
    print result.start()

Upvotes: 0

Related Questions