Reputation: 3155
In python 3:
This is the Office of Foreign Asset Control list where individuals' assets should be monitored
https://www.treasury.gov/ofac/downloads/sdn.csv
a lot of their data of births (the very last column, comma delimited) are like
DOB 23 Jun 1959; alt. DOB 23 Jun 1958
or
DOB 1959; alt. DOB 1958
I am trying to capture all the birthdates after the keyword "DOB" AND "alt. DOB" with the following codes:
if len(x.split(';')) > 0:
if len(re.findall('DOB (.*)', x.split(';')[0])) > 0:
new = re.findall('DOB | alt. DOB (.*)', x.split(';')[0])[0]
print(new)
try:
print(datetime.strptime(new, '%d %b %Y'))
return datetime.strptime(new, '%d %b %Y')
except:
return None
But the codes only get the birthdate right after "DOB", but not include the date of birth after "alt. DOB". Wonder how could i do it? Thank you.
Upvotes: 1
Views: 1566
Reputation: 163467
You could match DOB
and use a capturing group for the date part. For the date part, the number of days and the month can be optional followed by matching 4 digits.
The date part pattern does not validate the date itself, it makes the match a bit more specific.
\bDOB ((?:(?:3[01]|[12][0-9]|0?[1-9]) [A-Za-z]+ )?\d{4})\b
Explanation
\bDOB
Match literally preceded by a word boundary(
Capture group 1
(?:
Non capture group
(?:3[01]|[12][0-9]|0?[1-9]) [A-Za-z]+
Match a digit 1-31 and 1+ chars A-Za-z)?
Close group and make it optional\d{4}
Match 4 digits)\b
Close group 1 followed by a word boundaryFor example:
import re
regex = r"\bDOB ((?:(?:3[01]|[12][0-9]|0?[1-9]) [A-Za-z]+ )?\d{4})\b"
test_str = ("DOB 23 Jun 1959; alt. DOB 23 Jun 1958\n"
"DOB 1959; alt. DOB 1958")
print(re.findall(regex, test_str))
Output
['23 Jun 1959', '23 Jun 1958', '1959', '1958']
Upvotes: 1
Reputation: 8033
You can use (?<=DOB\s)[\s[a-zA-Z0-9]+]*
(?<=DOB\s) = Negative look-behind assertion. This matches string (which is to its right) only if the string preceded by letters DOB followed by a space
[\s[a-zA-Z0-9]+]* = Match space followed by letters of numbers multiple times
Example:
items=['DOB 23 Jun 1959; alt. DOB 23 Jun 1958', 'DOB 1959; alt. DOB 1958']
for item in items:
print(re.findall(r'(?<=DOB\s)[\s[a-zA-Z0-9]+]*',item))
Output
['23 Jun 1959', '23 Jun 1958']
['1959', '1958']
Upvotes: 1