Reputation: 2212
I have a pattern: "\nvariable WORD"
This pattern shows up a lot of times in the string and I want a list of indexes that this pattern shows up at. "WORD" is fixed, and doesn't change from instance to instance, but "variable" varies in content and length.
In python, I know this matches all WORD and returns their indices in a list:
contents="some long string"
print [m.start() for m in re.finditer('WORD',contents)]
So in short, how do I find indices of all "variable" after \n but before "WORD"?
Upvotes: 1
Views: 1787
Reputation: 2212
Ah, well, it turned out that the text actually contained ctrl-M return characters instead of newline characters, which drove me crazy. I removed those and I just used:
[m.start() for m in re.finditer('\w+\sWORD',contents)]
Thanks for all the help! Simpleparser also works, of course.
Upvotes: 0
Reputation: 77291
If the only tool you know is a hammer, every problem looks like a nail.
Regular expressions are powerful hammers, but sometimes not the best tool for the task in hand. In fact, regular expressions are abused a lot, I feel shivers down the spine every time someone asks me to check complex regular expressions from other programmer (often I'm unable understand mine after a few weeks).
On the other side, EBNF (Extended Backus–Naur Form) notation is a lot easier to understand and maintain.
from simpleparse.parser import Parser
grammar = r"""
<space> := [ \t]
<newline> := '\n'
<identifier> := [A-Za-z_],[A-Za-z0-9z_]*
match := newline,identifier,space+,'WORD'
<junk> := newline*,identifier,space+,-'WORD',(identifier/space)*
data := (match/junk)*
"""
parser = Parser(grammar, 'data')
data = 'some junk\nvariable1 WORD\nvariable2 some ' +\
'junk\nvariable3 WORD\nvariable4 some other ' +\
'junk\nvariable5 WORD'
(start, matches, stop) = parser.parse(data)
print [ start for name, start, stop, other in matches ]
This will print:
[9, 44, 85]
Upvotes: 3
Reputation: 2247
Would this sufice?
>>> import re
>>> s = '\nvariable1 WORD\nvariable2 WORD\nvariable3 WORD\nvariable4 WORD\nvariable5 WORD'
>>> re.findall(r'\n(\w+)\s+WORD', s)
['variable1', 'variable2', 'variable3', 'variable4', 'variable5']
What do you need the indexes for?
Upvotes: 2
Reputation: 1300
You may need to offset the indices from the start points depending on your objective. If by '\n' you are expecting newlines then you will have to include the MULTILINE flag in the compile.
import re
mytext='\nvar1 WORD\nvar2 WORD\nvar3 WORD'
#compile a pattern to find the 'var*' after \n
pat = re.compile('\n(.*?)\s+WORD')
results = re.finditer(pat,mytext)
for result in results:
print result.start()
Upvotes: 0