Krishna
Krishna

Reputation: 167

Extract date from string in a pandas dataframe column

I am trying to extract date from a DF column containing strings and store in another column.

from dateutil.parser import parse
 
extract = parse("January 24, 1976", fuzzy_with_tokens=True)
print(str(extract[0]))

The above code extracts: 1976-01-24 00:00:00

I would like this to be done to all strings in a column in a DF.

The below is what I am trying but is not working:

df['Dates'] = df.apply(lambda x: parse(x['Column to extract'], fuzzy_with_tokens=True), axis=1)

Things to note:

  1. If there are multiple dates, need to join them with some delimiter
  2. There can be strings without date. In that case parser returns an error "ParserError: String does not contain a date". This needs to be handled.

Upvotes: 0

Views: 1350

Answers (2)

Ian Thompson
Ian Thompson

Reputation: 3285

See pd.to_datetime

It operates in a vectorized manner so can convert all dates quickly.

df["Dates"] = pd.to_datetime(df["Dates"])

If there are strings that won't convert to a datetime and you want them nullified, you can use errors="coerce"

df["Dates"] = pd.to_datetime(df["Dates"], errors="coerce")

NER with spacy

import spacy  # 3.4.2
from spacy import displacy


nlp = spacy.load("en_core_web_sm")

eg_txt = "today is january 26, 2016. Tomorrow is january 27, 2016"

doc = nlp(eg_txt)

displacy.render(doc, style="ent")

ents

We can apply the spacy logic to a dataframe

import pandas as pd  # 1.5.1


# some fake data
df = pd.DataFrame({
    "text": ["today is january 26, 2016. Tomorrow is january 27, 2016",
             "today is january 26, 2016.",
              "Tomorrow is january 27, 2016"]
})

# convert text to spacy docs
docs = nlp.pipe(df.text.to_numpy())

# unpack the generator into a series
doc_series = pd.Series(docs, index=df.index, name="docs")

df = df.join(doc_series)

# extract entities
df["entities"] = df.docs.apply(lambda x: x.ents)

# explode to one entity per row
df = df.explode(column="entities")

# build dictionary of ent type and ent text
df["entities"] = df.entities.apply(lambda ent: {ent.label_: ent.text})

# join back with df
df = df.join(df["entities"].apply(pd.Series))

# convert all DATE entities to datetime
df["dates"] = pd.to_datetime(df.DATE, errors="coerce")

# back to one row per original text and a container of datetimes
df = df.groupby("text").dates.unique().to_frame().reset_index()

print(df)
                                                text                                              dates
0                       Tomorrow is january 27, 2016               [NaT, 2016-01-27T00:00:00.000000000]
1                         today is january 26, 2016.  [2022-11-17T11:42:49.607705000, 2016-01-26T00:...
2  today is january 26, 2016. Tomorrow is january...  [2022-11-17T11:42:49.605705000, 2016-01-26T00:...

Upvotes: 1

Nuri Taş
Nuri Taş

Reputation: 3845

If you want to use parse, you may need a customized function to handle exceptions:

def parse_date(row):
    try:
        date = parse(row, fuzzy_with_tokens=True)
        return date[0]
    except:
        return np.nan


df['dates'] = df['Column to extract'].apply(lambda x: parse_date(x))

Upvotes: 0

Related Questions