user10
user10

Reputation: 167

Pandas dataframe parse string column to extract dates into new column

I have a dataframe containing a string column:

text = ['dasndljaksdhfinwejfiv 12/05/2018'', 'akdhaouiyfbh adv12.03.2019','faytdvi advonalsdnfoaidv 5/9/2019IUAYFNVVKNVAIUEHF']

I want to extract just the dates from this and add to a new column.

I have tried the following but it just returns the original string value of the column:

df['date'] = df.text.replace({r"\d{2}[\/ ](\d{2}|January|Jan)[\/ ]\d{2,4}"},regex=True)

What's the best way to get around this?

Upvotes: 1

Views: 2195

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626893

You may use .str.extract(), use \d{1,2} to match 1-digit days and months and add the . to the delimiters pattern:

df['date'] = df['text'].str.extract(r"(\d{1,2}[/. ](?:\d{1,2}|January|Jan)[/. ]\d{2}(?:\d{2})?)")

See the regex demo.

Also, note that \d{2}(?:\d{2})? matches 2 or 4 digits, and \d{2,4} matches 2, 3 or 4 digits.

Note that the whole regex pattern is wrapped with a capturing group, it is necessary because .str.extract requires at least one capturing group to work, it will yield just the text captured.

To match the dates not inside other digits, you may add (?<!\d) at the start and (?!\d) at the end, it may make the pattern safer.

If you plan to match any English month name you need to expand the pattern a little:

r"(?<!\d)(\d{1,2}[/. ](?:\d{1,2}|(?:J(?:an(?:uary)?|u(?:ne?|ly?))|Feb(?:ruary)?|Ma(?:r(?:ch)?|y)|A(?:pr(?:il)?|ug(?:ust)?)|Sep(?:t(?:ember)?)?|(?:Nov|Dec)(?:ember)?|Oct(?:ober)?))[/. ]\d{2}(?:\d{2})?)(?!\d)"

See another regex demo.

Details

  • (?<!\d) - no digit immediately to the left is allowed
  • ( - start of a capturing group
  • \d{1,2} - 1 or 2 digits
  • [/. ] - space, / or .
  • (?:\d{1,2}|(?:J(?:an(?:uary)?|u(?:ne?|ly?))|Feb(?:ruary)?|Ma(?:r(?:ch)?|y)|A(?:pr(?:il)?|ug(?:ust)?)|Sep(?:t(?:ember)?)?|(?:Nov|Dec)(?:ember)?|Oct(?:ober)?)) - month name, abbreviated or not
  • [/. ] - space, / or .
  • \d{2}(?:\d{2})? - 2 digits followed with an optional sequence of two digits
  • ) - end of the capturing group
  • (?!\d) - no digit immediately to the right is allowed

Upvotes: 3

Related Questions