Reputation: 27
I want to capture the text between lines which begin with single semicolons:
sample input:
s = '''
;
the color blue
;
the color green
;
the color red
;
'''
this is the desired output:
['the color blue', 'the color green', 'the color red']
this attempted solution doesn't work:
import re
pat = r'^;(.*)^;'
r = re.findall(pat, s, re.S|re.M)
print(r)
this is the wrong output:
['\n\nthe color blue\n\n;\n\nthe color green\n\n;\n\nthe color red\n\n']
Upvotes: 1
Views: 70
Reputation: 43196
You can use ;\s*(.*?)\s*(?=;)
. Usage:
print( re.findall(r'(?s);\s*(.*?)\s*(?=;)', s) )
# output: ['the color blue', 'the color green', 'the color red']
Explanation:
(?s) # dot-all modifier (. matches newlines)
; # consume a semicolon
\s* # skip whitespace
(.*?) # capture the following text, as little as possible, such that...
\s* # ... it is followed only by (optional) whitespace, and...
(?=;) # ... a semicolon
Upvotes: 0
Reputation: 21643
You didn't ask for this, I know. But it's worth considering pyparsing as an alternative to re. Indeed, pyparsing properly contains regex. Notice how this simple parser copes with various numbers of empty lines.
>>> parsifal = open('temp.txt').read()
>>> print (parsifal)
;
the colour blue
;
the colour green
;
the colour red
;
the colour purple
;
the colour magenta
;
>>> import pyparsing as pp
>>> p = pp.OneOrMore(pp.Suppress(';\n')+pp.ZeroOrMore(pp.Suppress('\n'))+pp.CharsNotIn(';\n')+pp.ZeroOrMore(pp.Suppress('\n')))
>>> p.parseString(parsifal)
(['the colour blue', 'the colour green', 'the colour red', 'the colour purple', 'the colour magenta'], {})
As a whole, the parser matches OneOrMore
sequences of semicolons or new-lines, followed by anything other than these characters followed by new-lines.
Upvotes: 0
Reputation:
Treat it like delimiters.
(?sm)^;\s*\r?\n(.*?)\s*(?=^;\s*\r?\n)
https://regex101.com/r/4tKX0F/1
Explained
(?sm) # Modifiers: dot-all, multi-line
^ ; \s* \r? \n # Begining delimiter
( .*? ) # (1), Text
\s* # Wsp trim
(?= ^ ; \s* \r? \n ) # End delimiter
Upvotes: 1
Reputation: 3049
You can have this as the pattern:
pat = r';\n\n([\w* *]*)'
r = re.findall(pat, s)
That should capture what you need.
Upvotes: 0
Reputation: 8636
Non-regex solution,I split on ;
and remove null strings
s = '''
;
the color blue
;
the color green
;
the color red
;
'''
f = s.split(';')
x = [a.strip('\n') for a in f]
print(x) #prints ['', 'the color blue', 'the color green', 'the color red', '']
a = [elem for elem in x if len(elem)]
print(a) #prints ['the color blue', 'the color green', 'the color red']
Upvotes: 1