Reputation: 754
Test string:
First
Name:
MICKEY
One to
four lines
of cruft go here
Last
Name:
MOUSE
More cruft
goes here
I want to return a single group "MICKEY MOUSE"
I have:
(?:First\WName:)\W((.+)\W(?:((.+\W){1,4})(?:Last\WName:\W))(.+))
Group 2 returns MICKEY and group 5 returns MOUSE.
I thought that enclosing them in a single group and making the middle cruft and Last name segments non-capturing groups with ?:
would prevent them from appearing. But Group 1 returns
MICKEY
One to
four lines
of cruft go here
Last
Name:
MOUSE
How can I get it to remove the middle stuff from what's returned (or alternately combine groups 2 and group 5 into a single named or numbered group)?
Upvotes: 1
Views: 19749
Reputation: 26
To solve this you could make use of non capturing groups in regex. These are declared with: (?:)
After modifying the regex to:
(?:First\WName:)\W((.+)\W(?:(?:(?:.+\W){1,4})(?:Last\WName:\W))(.+))
you can do the following in python:
import re
inp = """
First
Name:
MICKEY
One to
four lines
of cruft go here
Last
Name:
MOUSE
More cruft
goes here
"""
query = r'(?:First\WName:)\W((.+)\W(?:(?:(?:.+\W){1,4})(?:Last\WName:\W))(.+))'
output = ' '.join(re.match(query, inp).groups())
Upvotes: 1
Reputation: 92854
With re.search()
function and specific regex pattern:
import re
s = '''
First
Name:
MICKEY
One to
four lines
of cruft go here
Last
Name:
MOUSE
More cruft
goes here'''
result = re.search(r'Name:\n(?P<firstname>\S+)[\s\S]*Name:\n(?P<lastname>\S+)', s).groupdict()
print(result)
The output:
{'firstname': 'MICKEY', 'lastname': 'MOUSE'}
----------
Or even simpler with re.findall()
function:
result = re.findall(r'(?<=Name:\n)(\S+)', s)
print(result)
The output:
['MICKEY', 'MOUSE']
Upvotes: 1
Reputation: 71461
You can split the string and check if all characters are uppercase:
import re
s = """
First
Name:
MICKEY
One to
four lines
of cruft go here
Last
Name:
MOUSE
More cruft
goes here
"""
final_data = ' '.join(i for i in s.split('\n') if re.findall('^[A-Z]+$', i))
Output:
'MICKEY MOUSE'
Or, a pure regex solution:
new_data = ' '.join(re.findall('(?<=)[A-Z]+(?=\n)', s))
Output:
'MICKEY MOUSE'
Upvotes: 0