Reputation: 2650
I want to extract the first date in format yyyy-mm-dd from a dataframe (Pandas). When no date is found, just return an empty string. These are some example of the data stored in the dataframe.
1976-05-17 [ ]
[ ] 1976-05-172
1976-05-17
1976-05-17 Atlanta, Georgia U.S.
1976-05-17 1975-07-11
( 1976-05-17 ) 1976-05-17 (age 38) [ ]
In all cases I want 1976-05-17 or an empty string.
The result would be running a regular expression on a DataFrame and add the result to a new column
Upvotes: 2
Views: 1800
Reputation: 180391
To get the first use search
which will stop at the first matched substring:
r = re.compile("\d{4}-\d{2}-\d{2}")
Using your sample:
lines = """1976-05-17 [ ]
[ ] 1976-05-172
1976-05-17
1976-05-17 Atlanta, Georgia U.S.
1976-05-17 1975-07-11
( 1976-05-17 ) 1976-05-17 (age 38) [ ]"""
r = re.compile("\d{4}-\d{2}-\d{2}")
for line in lines.splitlines():
m = r.search(line)
if m:
print(m.group())
Output:
1976-05-17
1976-05-17
1976-05-17
1976-05-17
1976-05-17
1976-05-17
If you are applying it to a df, you can test if there is a match if so use the match or else use an empty string as the value i.e.
import pandas as pd
df = pd.read_csv("test.txt")
print(df)
def match(x):
m = r.search(x)
if m:
return m.group()
return ""
Output:
print(df)
print df["date"].apply(match)
date
0 1976-05-17 [ ]
1 [ ] 1976-05-172
2 1976-05-17
3 1976-05-17 Atlanta, Georgia U.S.
4 1976-05-17 1975-07-11
5 ( 1976-05-17 ) 1976-05-17 (age 38) [ ]
0 1976-05-17
1 1976-05-17
2 1976-05-17
3 1976-05-17
4 1976-05-17
5 1976-05-17
Name: date, dtype: object
You could also set the column equal to the return value of str.extract
if you were ok with Nan for non matches:
print df["date"].str.extract(r"(\d{4}-\d{2}-\d{2})")
Adding "foo"
to the column:
0 1976-05-17
1 1976-05-17
2 1976-05-17
3 1976-05-17
4 1976-05-17
5 1976-05-17
6 NaN
Name: date, dtype: object
Upvotes: 3
Reputation: 5600
You can use regex to find dates in a string:
import re
line = '1976-05-17 Atlanta, Georgia U.S.'
dates = re.findall('[0-9]{4}-[0-9]{2}-[0-9]{2}', line)
date = dates[0] if len(dates) else None
Upvotes: 0