Zephyr
Zephyr

Reputation: 1352

Date and time column has mixed format

I am having some issue in formatting the date and time. I have data file that contains date and time. Below is the sample date that represents part of my data.

data = pd.DataFrame()
data['Date'] = ['01 Jul 2014 - Qualification','30 Sep 2014 - Group Stage','17 Mar 2015 - Play Offs',' 19:00:00']
data ['ID'] = [1,2,3,4]

I created a new columns and tried to format using datetime as follow:

data['date1'] = pd.to_datetime(data.Date,errors = 'coerce')

I got all NaT in date time. I also wanted to create two new columns such as Time column and stage to represent the time and the game stage.

How can I proceed with the issue?

Upvotes: 0

Views: 317

Answers (2)

jezrael
jezrael

Reputation: 862661

You can use regex here with Series.str.extract:

#https://stackoverflow.com/a/47656743
pat = r'(\d+/\d+(?:/\d+)?|(?:\d+ )?(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)[.,]?(?:-\d+-\d+| \d+(?:th|rd|st|nd)?,? \d+| \d+)|\d{4})'

#https://stackoverflow.com/a/46069885
pat = r'((?:\d{,2}\s)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*(?:-|\.|\s|,)\s?\d{,2}[a-z]*(?:-|,|\s)?\s?\d{2,4})'

s = data['Date'].str.extract(pat, expand=False)
data['date1']  = pd.to_datetime(s, errors = 'coerce')
print (data)
                          Date  ID      date1
0  01 Jul 2014 - Qualification   1 2014-07-01
1    30 Sep 2014 - Group Stage   2 2014-09-30
2      17 Mar 2015 - Play Offs   3 2015-03-17
3                     19:00:00   4        NaT

Upvotes: 1

amanb
amanb

Reputation: 5463

The Date column has text that is other than just date/time. You cannot convert it to datetime object as it is. You need to isolate the date/time part of the text from the rest of it. To do this, you can split on - and expand to get the Stage text and date in separate columns of a temp dataframe df_temp and then use these columns to assign & create each in your existing dataframe:

In [27]: df_temp = data['Date'].str.split('-', expand=True)

In [28]: data['date1'] = df_temp[0]

In [29]: data['stage'] = df_temp[1]

In [30]: data
Out[30]:
                          Date  ID         date1           stage
0  01 Jul 2014 - Qualification   1  01 Jul 2014    Qualification
1    30 Sep 2014 - Group Stage   2  30 Sep 2014      Group Stage
2      17 Mar 2015 - Play Offs   3  17 Mar 2015        Play Offs
3                     19:00:00   4      19:00:00            None

In [31]: data['date1'] = pd.to_datetime(data.date1,errors = 'coerce')

In [32]: data
Out[32]:
                          Date  ID      date1           stage
0  01 Jul 2014 - Qualification   1 2014-07-01   Qualification
1    30 Sep 2014 - Group Stage   2 2014-09-30     Group Stage
2      17 Mar 2015 - Play Offs   3 2015-03-17       Play Offs
3                     19:00:00   4        NaT            None

Upvotes: 1

Related Questions