Chubaka
Chubaka

Reputation: 3155

match multiple OR conditions in python 3 regex findall

In python 3:

This is the Office of Foreign Asset Control list where individuals' assets should be monitored

https://www.treasury.gov/ofac/downloads/sdn.csv

a lot of their data of births (the very last column, comma delimited) are like

DOB 23 Jun 1959; alt. DOB 23 Jun 1958

or

DOB 1959; alt. DOB 1958

I am trying to capture all the birthdates after the keyword "DOB" AND "alt. DOB" with the following codes:

   if len(x.split(';')) > 0:
        if len(re.findall('DOB (.*)', x.split(';')[0])) > 0:
            new = re.findall('DOB | alt. DOB (.*)', x.split(';')[0])[0]
            print(new)

            try:
                print(datetime.strptime(new, '%d %b %Y'))
                return datetime.strptime(new, '%d %b %Y')
            except:
                return None

But the codes only get the birthdate right after "DOB", but not include the date of birth after "alt. DOB". Wonder how could i do it? Thank you.

Upvotes: 1

Views: 1566

Answers (2)

The fourth bird
The fourth bird

Reputation: 163467

You could match DOB and use a capturing group for the date part. For the date part, the number of days and the month can be optional followed by matching 4 digits.

The date part pattern does not validate the date itself, it makes the match a bit more specific.

\bDOB ((?:(?:3[01]|[12][0-9]|0?[1-9]) [A-Za-z]+ )?\d{4})\b

Explanation

  • \bDOB Match literally preceded by a word boundary
  • ( Capture group 1
    • (?: Non capture group
      • (?:3[01]|[12][0-9]|0?[1-9]) [A-Za-z]+ Match a digit 1-31 and 1+ chars A-Za-z
    • )? Close group and make it optional
    • \d{4} Match 4 digits
  • )\b Close group 1 followed by a word boundary

Regex demo | Python demo

For example:

import re

regex = r"\bDOB ((?:(?:3[01]|[12][0-9]|0?[1-9]) [A-Za-z]+ )?\d{4})\b"
test_str = ("DOB 23 Jun 1959; alt. DOB 23 Jun 1958\n"
    "DOB 1959; alt. DOB 1958")

print(re.findall(regex, test_str))

Output

['23 Jun 1959', '23 Jun 1958', '1959', '1958']

Upvotes: 1

moys
moys

Reputation: 8033

You can use (?<=DOB\s)[\s[a-zA-Z0-9]+]*

   (?<=DOB\s)  = Negative look-behind assertion. This matches string (which is to its right) only if the string preceded by letters DOB followed by a space
   [\s[a-zA-Z0-9]+]* = Match space followed by letters of numbers multiple times

Example:

items=['DOB 23 Jun 1959; alt. DOB 23 Jun 1958', 'DOB 1959; alt. DOB 1958']
for item in items:
    print(re.findall(r'(?<=DOB\s)[\s[a-zA-Z0-9]+]*',item))

Output

['23 Jun 1959', '23 Jun 1958']
['1959', '1958']

Upvotes: 1

Related Questions