oscarm
oscarm

Reputation: 2650

Extract first date from string in a data frame

I want to extract the first date in format yyyy-mm-dd from a dataframe (Pandas). When no date is found, just return an empty string. These are some example of the data stored in the dataframe.

1976-05-17 [ ]
[ ] 1976-05-172 
1976-05-17       
1976-05-17 Atlanta, Georgia U.S.  
1976-05-17 1975-07-11
( 1976-05-17 ) 1976-05-17 (age 38) [ ]

In all cases I want 1976-05-17 or an empty string.

The result would be running a regular expression on a DataFrame and add the result to a new column

Upvotes: 2

Views: 1800

Answers (2)

Padraic Cunningham
Padraic Cunningham

Reputation: 180391

To get the first use search which will stop at the first matched substring:

 r = re.compile("\d{4}-\d{2}-\d{2}")

Using your sample:

lines = """1976-05-17 [ ]
[ ] 1976-05-172
1976-05-17
1976-05-17 Atlanta, Georgia U.S.
1976-05-17 1975-07-11
( 1976-05-17 ) 1976-05-17 (age 38) [ ]"""
r = re.compile("\d{4}-\d{2}-\d{2}")
for line in lines.splitlines():
    m = r.search(line)
    if m:
        print(m.group())

Output:

1976-05-17
1976-05-17
1976-05-17
1976-05-17
1976-05-17
1976-05-17

If you are applying it to a df, you can test if there is a match if so use the match or else use an empty string as the value i.e.

import pandas as pd

df = pd.read_csv("test.txt")
print(df)
def match(x):
    m = r.search(x)
    if m:
        return  m.group()
    return  ""

Output:

print(df)
print df["date"].apply(match)

                                     date
0                          1976-05-17 [ ]
1                         [ ] 1976-05-172
2                              1976-05-17
3        1976-05-17 Atlanta, Georgia U.S.
4                   1976-05-17 1975-07-11
5  ( 1976-05-17 ) 1976-05-17 (age 38) [ ]


0    1976-05-17
1    1976-05-17
2    1976-05-17
3    1976-05-17
4    1976-05-17
5    1976-05-17
Name: date, dtype: object

You could also set the column equal to the return value of str.extract if you were ok with Nan for non matches:

print df["date"].str.extract(r"(\d{4}-\d{2}-\d{2})")

Adding "foo" to the column:

0    1976-05-17
1    1976-05-17
2    1976-05-17
3    1976-05-17
4    1976-05-17
5    1976-05-17
6           NaN
Name: date, dtype: object

Upvotes: 3

ahmed
ahmed

Reputation: 5600

You can use regex to find dates in a string:

import re

line = '1976-05-17 Atlanta, Georgia U.S.'
dates = re.findall('[0-9]{4}-[0-9]{2}-[0-9]{2}', line)

date = dates[0] if len(dates) else None

Upvotes: 0

Related Questions