Reputation: 3177
I want to know the regexp for the following case:
The string contains an uppercase word in a single line with two newlines before. After that, there are several lines of alphanumeric letters (maybe non-ASCII utf-8) or maybe an empty line. I want to capture the whole portion starting with the uppercase word in a line and ends just before next uppercase word-line. Single-liner uppercase words may have duplicates.
I explored and looked up a lot but failed.
Example
ASDF
wqer rtre 34 $^&% fsfa
DDwrgd 43 er 1. ewrtfg
324rfegf 4gfgre
PIIPUU
gre tt HKH rre345
sdrfetre
ewrewrqwr werfewrt34vds
ret
gre
wretretertettre
PIIPUU
asdf reb dsfdsg
dsafdfbh rt3456 rge grefgreg
reretr erfret34 ef
retretretr
QWE
pritoy Fbhfg 45345 )*9
tret 345 gret54
retre 56 gre ger
retgrh 546ttre
MMNNBMB
aserew Sfjlkjf
gdf
rerettyrdfv re HFGHFFHF er
ergre ret retre
ret retretret
reg regrtgh rertgre tret
I want to separate all the portions that match the condition like bellow:
ASDF
wqer rtre 34 $^&% fsfa
DDwrgd 43 er 1. ewrtfg
324rfegf 4gfgre
PIIPUU
gre tt HKH rre345
sdrfetre
ewrewrqwr werfewrt34vds
ret
gre
wretretertettre
PIIPUU
asdf reb dsfdsg
dsafdfbh rt3456 rge grefgreg
reretr erfret34 ef
retretretr
QWE
pritoy Fbhfg 45345 )*9
tret 345 gret54
retre 56 gre ger
retgrh 546ttre
MMNNBMB
aserew Sfjlkjf
gdf
rerettyrdfv re HFGHFFHF er
ergre ret retre
ret retretret
reg regrtgh rertgre tret
Upvotes: 0
Views: 104
Reputation: 27723
This expression is likely to extract our desired outputs:
(?=^[A-Z]+$)([\s\S]*?)(?=^[A-Z]+$)|([\s\S]*)
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.
import re
regex = r"(?=^[A-Z]+$)([\s\S]*?)(?=^[A-Z]+$)|([\s\S]*)"
test_str = """
ASDF
wqer rtre 34 $^&% fsfa
DDwrgd 43 er 1. ewrtfg
324rfegf 4gfgre
QWE
pritoy Fbhfg 45345 )*9
tret 345 gret54
retre 56 gre ger
retgrh 546ttre
PIIPUU
gre tt HKH rre345
sdrfetre
ewrewrqwr werfewrt34vds
ret
gre
wretretertettre
MMNNBMB
aserew Sfjlkjf
gdf
rerettyrdfv re HFGHFFHF er
ergre ret retre
ret retretret
reg regrtgh rertgre tret
"""
print(re.findall(regex, test_str, re.MULTILINE))
[('', ''), ('ASDF\nwqer rtre 34 $^&% fsfa\nDDwrgd 43 er 1. ewrtfg\n324rfegf 4gfgre\n\n', ''), ('', ''), ('QWE\npritoy Fbhfg 45345 )*9\ntret 345 gret54\nretre 56 gre ger\nretgrh 546ttre\n\n', ''), ('', ''), ('PIIPUU\ngre tt HKH rre345 \nsdrfetre\newrewrqwr werfewrt34vds\n\nret\ngre\nwretretertettre\n\n', ''), ('', ''), ('', 'MMNNBMB\naserew Sfjlkjf\ngdf\nrerettyrdfv re HFGHFFHF er\nergre ret retre \nret retretret \n\nreg regrtgh rertgre tret'), ('', '')]
Upvotes: 4
Reputation: 336168
Try this:
regex = re.compile(r"^[A-Z]+\r?\n(?:(?!^\r?\n[A-Z]+\r?\n).)*", re.MULTILINE|re.DOTALL)
Explanation:
^ # Start of line
[A-Z]+ # Match uppercase ASCII keyword
\r?\n # Match newline
(?: # Start of non-capturing group
(?!^\r?\n[A-Z]+\r?\n) # Make sure we're not (yet) at the start of another keyword
. # If so, match any character including newline
)* # Repeat any number of times.
Test it live on regex101.com.
Upvotes: 3
Reputation: 521389
Here is one approach using re.findall
:
matches = re.findall(r'(?:^|\n\n)([A-Z]{3,}.*?)(?=\n\n[A-Z]{3,}\n|$)', input, flags=re.DOTALL)
print(matches)
This prints:
['ASDF\nwqer rtre 34 $^&% fsfa\nDDwrgd 43 er 1. ewrtfg\n324rfegf 4gfgre',
'QWE\npritoy Fbhfg 45345 )*9\ntret 345 gret54\nretre 56 gre ger\nretgrh 546ttre',
'PIIPUU\ngre tt HKH rre345 \nsdrfetre\newrewrqwr werfewrt34vds\n\nret\ngre\nwretretertettre',
'MMNNBMB\naserew Sfjlkjf\ngdf\nrerettyrdfv re HFGHFFHF er\nergre ret retre \nret retretret \n\nreg regrtgh rertgre tret']
Here is an explanation of the regex pattern being used:
(?:^|\n\n) match either the start of the input or two consecutive newlines
([A-Z]{3,}.*?) then match and capture three or more capital letters,
followed by all content (including newlines) until seeing
(?=\n\n[A-Z]{3,}\n|$) either two newlines and a capital term or the end of the input
Upvotes: 3