Reputation: 602
I have a few thousand blocks of text which may or may not contain a date of death for the person in the record, which is always in the form:
(d. xxxxxxxxxxxxx)
that is that it starts with parenthesis, followed by a d
and .
, then some date text and closes with the final parenthesis.
I wrote the following code with a few test samples to test a Regex which I wrote:
import re
tests = ["Milt Jackson, vibraphone, piano, guitar, 1923 (d. October 9, 1999)", "Howard Johnson, alto sax, 1908 (d. December 28, 1991)","Sonny Greenwich, guitar, 1936", "Eiichi Hayashi, alto sax, 1960", "Yoshio Ikeda, bass, 1942", "Urs Leimgruber, saxophones, bass clarinet. 1952"]
for test in tests:
m = re.match ("\(d.(.*)\)", test)
if m:
print(m.groups())
However it prints no results.
I've tested the Regex in an online Regex tester and it works for valid test input.
So, I guess my code is wrong. Can anyone suggest why, please?
Finally - what I want to extract is date of death itself (not the parenthesis and d.
)- any suggestions how I could do that?
Upvotes: 0
Views: 57
Reputation:
re.match
always matches from the start of the string. From the docs:
re.match(pattern, string, flags=0)
If zero or more characters at the beginning of
string
match the regular expressionpattern
, return a corresponding match object.
Emphasis mine.
You need to use re.search
to have Python search for a pattern anywhere in the string:
>>> import re
>>> tests = ["Milt Jackson, vibraphone, piano, guitar, 1923 (d. October 9, 1999)", "Howard Johnson, alto sax, 1908 (d. December 28, 1991)","Sonny Greenwich, guitar, 1936", "Eiichi Hayashi, alto sax, 1960", "Yoshio Ikeda, bass, 1942", "Urs Leimgruber, saxophones, bass clarinet. 1952"]
>>>
>>> for test in tests:
... m = re.search ("\(d\.(.*)\)", test)
... if m:
... print(m.groups())
...
(' October 9, 1999',)
(' December 28, 1991',)
>>>
Also, in your pattern, I escaped the .
after d
to have Python match a literal period. Otherwise, Python will match any character there (except a newline).
Upvotes: 3
Reputation: 180391
Considering it is always in the form (d. xxxxxxxxxxxxx) and your regex and the answers supplied catch anything in the format (r. then anything)
unless you will have cases where you have an (r. followed a space
and no closing paren then you can do this without a regex:
tests = ["Milt Jackson, vibraphone, piano, guitar, 1923 (d. October 9, 1999)", "Howard Johnson, alto sax, 1908 (d. December 28, 1991)","Sonny Greenwich, guitar, 1936", "Eiichi Hayashi, alto sax, 1960", "Yoshio Ikeda, bass, 1942", "Urs Leimgruber, saxophones, bass clarinet. 1952"]
for line in tests:
if "(d." in line:
spl = line.split("(d. ")[1]
print(spl[:spl.find(")")])
October 9, 1999
December 28, 1991
Upvotes: 0
Reputation: 26667
Use search
instead of match
for test in tests:
... m = re.search ("\(d.(.*)\)", test)
... if m:
... print(m.groups())
...
(' October 9, 1999',)
(' December 28, 1991',)
Why match
wont work?
Tha match
searches the pattern at the start of the string. In the test string, the matched part is not at the start of the string and hence match
fails. Where as search
searches for the pattern anywhere in the string.
re.search(pattern, string, flags=0)
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern;
Upvotes: 1