Watty62
Watty62

Reputation: 602

Finding the contents of parenthesis

I have a few thousand blocks of text which may or may not contain a date of death for the person in the record, which is always in the form:

(d. xxxxxxxxxxxxx)

that is that it starts with parenthesis, followed by a d and ., then some date text and closes with the final parenthesis.

I wrote the following code with a few test samples to test a Regex which I wrote:

import re
tests = ["Milt Jackson, vibraphone, piano, guitar, 1923 (d. October 9, 1999)", "Howard Johnson, alto sax, 1908 (d. December 28, 1991)","Sonny Greenwich, guitar, 1936", "Eiichi Hayashi, alto sax, 1960", "Yoshio Ikeda, bass, 1942", "Urs Leimgruber, saxophones, bass clarinet. 1952"]

for test in tests:
    m = re.match ("\(d.(.*)\)", test)
    if m:
        print(m.groups())

However it prints no results.

I've tested the Regex in an online Regex tester and it works for valid test input.

So, I guess my code is wrong. Can anyone suggest why, please?

Finally - what I want to extract is date of death itself (not the parenthesis and d.)- any suggestions how I could do that?

Upvotes: 0

Views: 57

Answers (3)

user2555451
user2555451

Reputation:

re.match always matches from the start of the string. From the docs:

re.match(pattern, string, flags=0)

If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object.

Emphasis mine.

You need to use re.search to have Python search for a pattern anywhere in the string:

>>> import re
>>> tests = ["Milt Jackson, vibraphone, piano, guitar, 1923 (d. October 9, 1999)", "Howard Johnson, alto sax, 1908 (d. December 28, 1991)","Sonny Greenwich, guitar, 1936", "Eiichi Hayashi, alto sax, 1960", "Yoshio Ikeda, bass, 1942", "Urs Leimgruber, saxophones, bass clarinet. 1952"]
>>>
>>> for test in tests:
...     m = re.search ("\(d\.(.*)\)", test)
...     if m:
...         print(m.groups())
...
(' October 9, 1999',)
(' December 28, 1991',)
>>>

Also, in your pattern, I escaped the . after d to have Python match a literal period. Otherwise, Python will match any character there (except a newline).

Upvotes: 3

Padraic Cunningham
Padraic Cunningham

Reputation: 180391

Considering it is always in the form (d. xxxxxxxxxxxxx) and your regex and the answers supplied catch anything in the format (r. then anything) unless you will have cases where you have an (r. followed a space and no closing paren then you can do this without a regex:

tests = ["Milt Jackson, vibraphone, piano, guitar, 1923 (d. October 9, 1999)", "Howard Johnson, alto sax, 1908 (d. December 28, 1991)","Sonny Greenwich, guitar, 1936", "Eiichi Hayashi, alto sax, 1960", "Yoshio Ikeda, bass, 1942", "Urs Leimgruber, saxophones, bass clarinet. 1952"]
for line in tests:
    if "(d." in line:
        spl = line.split("(d. ")[1]
        print(spl[:spl.find(")")])

 October 9, 1999
 December 28, 1991

Upvotes: 0

nu11p01n73R
nu11p01n73R

Reputation: 26667

Use search instead of match

for test in tests:
...     m = re.search ("\(d.(.*)\)", test)
...     if m:
...         print(m.groups())
... 
(' October 9, 1999',)
(' December 28, 1991',)

Why match wont work?

Tha match searches the pattern at the start of the string. In the test string, the matched part is not at the start of the string and hence match fails. Where as search searches for the pattern anywhere in the string.

  • re.search(pattern, string, flags=0)

    Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern;

Upvotes: 1

Related Questions