Reputation: 7832
I am using this tutorial to learn regex in python - looks like an excellent tutorial!
So the tutorial is as follows: http://regex101.com/r/vB7mV2
According to the tutorial, the code I should use is:
import re
p = re.compile(r'^(?P<Given>\w+) (?P<Middle>\w\.) (?P<Family>\w+)$', re.MULTILINE)
str = "Jack A. Smith\nMary B. Miller"
m = p.match(str)
print m.group(0)
Jack A. Smith
print m.group(1)
Jack
print m.group(2)
A.
print m.group(3)
Smith
print m.group(4)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: no such group
To my surprise, I lost little Mary B. Miller - there was no m.group(4)
So I have a few follow up questions:
(1) I am using multiline, why did it match only the first one, i.e., Jack A. Smith in the example?
(2) I am using the Given, Middle and Family as the tag names for each match, how do I access the data using such tags and not just m.group(i)
(3) Let us say I want to do match and replace? I.e., I want to match Mary B. Miller, and replace by Jane M. Goldstein, such that the replaced string will now be: str = "Jack A. Smith\nJane M. Goldstein"
. How'd I go to do that? (kind of unrelated, let's call it a bonus Q)
Upvotes: 1
Views: 93
Reputation: 103704
I think I would do something like this:
import re
txt='''\
Jack A. Smith
Mary B. Miller
Jordan Brewster
Kathy Beth Turner'''
>>> [m.groups() for m in re.finditer(r'^(\w+)\s+(\w\.|\w*)\s*(\b\w+\b)$', txt, re.M)]
[('Jack', 'A.', 'Smith'), ('Mary', 'B.', 'Miller'), ('Jordan', '', 'Brewster'), ('Kathy', 'Beth', 'Turner')]
Works like so:
^(\w+)\s+(\w\.|\w*)\s*(\b\w+\b)$
This allows you to capture names with an optional middle name or middle initial.
Upvotes: 1
Reputation: 1087
I am using the Given, Middle and Family as the tag names for each match, how do I access the data using such tags and not just m.group(i)
You can use m.group('Given'), m.group('Middle'), m.group('Family')
Let us say I want to do match and replace? I.e., I want to match Mary B. Miller, and replace by Jane M. Goldstein, such that the replaced string will now be: str = "Jack A. Smith\nJane M. Goldstein". How'd I go to do that?
re.sub()
can be used for search and replace as far as I know.
Upvotes: 1
Reputation: 39355
Copied from the re.match()
Note that even in MULTILINE mode, re.match() will only match
at the beginning of the string and not at the beginning of each line
That's why you are getting only the first match. if you need all the matches, use re.findall()
Wrapping your whole regex inside the ()
here is an example:
p = re.compile(r'^((?P<Given>\w+) (?P<Middle>\w\.) (?P<Family>\w+))$', re.MULTILINE)
str = "Jack A. Smith\nMary B. Miller"
print re.findall(p, str)
Output:
[('Jack A. Smith', 'Jack', 'A.', 'Smith'), ('Mary B. Miller', 'Mary', 'B.', 'Miller')]
UPDATE::
About your question-2: Use re.finditer() for this. An example:
p = re.compile(r'^(?P<FullName>(?P<Given>\w+) (?P<Middle>\w\.) (?P<Family>\w+))$', re.MULTILINE)
str = "Jack A. Smith\nMary B. Miller"
matches = re.finditer(p, str)
for match in matches:
info = match.groupdict() ## pulling out the match as dictionary
print info
print info['Family']
Question-3:
Using the re.sub() will be sufficient for this replacement.
print re.sub("Mary B\. Miller", "Jane M. Goldstein", str)
## notice I have escaped the . with \.
## in regex . means any non white space characters.
Upvotes: 1
Reputation: 28653
From the documentation of re module:
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
You can use re.findall or re.finditer to find all matches:
>>> for match in p.finditer(str):
... print match.groups()
('Jack', 'A.', 'Smith')
('Mary', 'B.', 'Miller')
To used name of the groups instead of indexes you can specify the group name you have used:
>>> for match in p.finditer(str):
... print match.group('Given')
Jack
Mary
Upvotes: 1