Reputation: 27

How can one use regex to capture the text occurring between lines beginning with a single semicolon?

I want to capture the text between lines which begin with single semicolons:

sample input:

s = '''
;

the color blue

;

the color green

;

the color red

;
'''

this is the desired output:

['the color blue', 'the color green', 'the color red']

this attempted solution doesn't work:

import re
pat = r'^;(.*)^;'
r = re.findall(pat, s, re.S|re.M)
print(r)

this is the wrong output:

['\n\nthe color blue\n\n;\n\nthe color green\n\n;\n\nthe color red\n\n']

Upvotes: 1

Answers (5)

Aran-Fey

Reputation: 43196

You can use ;\s*(.*?)\s*(?=;). Usage:

print( re.findall(r'(?s);\s*(.*?)\s*(?=;)', s) )
# output: ['the color blue', 'the color green', 'the color red']

Explanation:

(?s)   # dot-all modifier (. matches newlines)
;      # consume a semicolon
\s*    # skip whitespace
(.*?)  # capture the following text, as little as possible, such that...
\s*    # ... it is followed only by (optional) whitespace, and...
(?=;)  # ... a semicolon

Upvotes: 0

Bill Bell

Reputation: 21643

You didn't ask for this, I know. But it's worth considering pyparsing as an alternative to re. Indeed, pyparsing properly contains regex. Notice how this simple parser copes with various numbers of empty lines.

>>> parsifal = open('temp.txt').read()
>>> print (parsifal)


;

the colour blue
;
the colour green
;
the colour red
;
the colour purple




;

the colour magenta

;


>>> import pyparsing as pp
>>> p = pp.OneOrMore(pp.Suppress(';\n')+pp.ZeroOrMore(pp.Suppress('\n'))+pp.CharsNotIn(';\n')+pp.ZeroOrMore(pp.Suppress('\n')))
>>> p.parseString(parsifal)
(['the colour blue', 'the colour green', 'the colour red', 'the colour purple', 'the colour magenta'], {})

As a whole, the parser matches OneOrMore sequences of semicolons or new-lines, followed by anything other than these characters followed by new-lines.

Upvotes: 0

user557597

Reputation:

Treat it like delimiters.

(?sm)^;\s*\r?\n(.*?)\s*(?=^;\s*\r?\n)

https://regex101.com/r/4tKX0F/1

Explained

 (?sm)                         # Modifiers: dot-all, multi-line
 ^ ; \s* \r? \n                # Begining delimiter
 ( .*? )                       # (1), Text 
 \s*                           # Wsp trim
 (?= ^ ; \s* \r? \n )          # End delimiter

Upvotes: 1

s3bw

Reputation: 3049

You can have this as the pattern:

pat = r';\n\n([\w* *]*)'

r = re.findall(pat, s)

That should capture what you need.

Upvotes: 0

PYA

Reputation: 8636

Non-regex solution,I split on ; and remove null strings

s = '''
    ;

    the color blue


;

the color green

;

the color red

;
'''

f = s.split(';')


x = [a.strip('\n') for a in f]

print(x) #prints ['', 'the color blue', 'the color green', 'the color red', '']

a = [elem for elem in x if len(elem)]

print(a) #prints ['the color blue', 'the color green', 'the color red']

Upvotes: 1

How can one use regex to capture the text occurring between lines beginning with a single semicolon?

Answers (5)

Related Questions