Zephyr
Zephyr

Reputation: 1352

How to handle irregular format from CSV using Pandas

I have been trying different ways to handle date time in Pandas from CSV.

I have 3 columns in csv file:

  1. kickoffDate
  2. kickoffTime
  3. time

The first two columns are well formatted. But the format of third column "time" is mixed up.. some represents as time, some represent as date time.

For example:

12:00:00 AM
1/1/1900 9:04:00 PM

How do I make same format using pandas?

Second thing is that I want to add all three columns to get a event time.

For example:

kickoffDate =['8/6/2017','8/6/2017','8/6/2017']
kickoffTime =['15:00:00','15:00:00','15:00:00']
time =['51:48:00','86:05:00','10:04']`

time in this case is mm:ss:00 format. i want to combine those three column to create a new column so called eventdatetime:

eventdatetime = [06-08-2017 15:51:48, 06-08-2017 16:26:05,06-08-2017 15:10:04]`

How can i do this? I am able to combine the first two by using following formula:

DateTime1 = data['kickoffDate']+' '+ data ['kickoffTime']

The original csv file can be downloaded from below link:

https://drive.google.com/open?id=1JL65x7nq2m6zk4qnaRUDKL894aEdXW_B

Upvotes: 0

Views: 235

Answers (1)

jezrael
jezrael

Reputation: 863166

You can use parameter parse_dates with first and second columns for datetimes, then convert last column to_timedelta with adding 00: for hours for values with no one ::

df = pd.read_csv('Datetimetest.csv', parse_dates=[[0, 1]])

m = df['time'].str.count(':') != 1
df['time'] = pd.to_timedelta('00:' + df['time'].mask(m, df['time'].str.replace(':00$', '')))
df['eventdatetime'] = df['kickoffDate_kickoffTime'] + df['time']
print (df.head())
  kickoffDate_kickoffTime     time       eventdatetime
0     2018-04-30 19:00:00 00:47:36 2018-04-30 19:47:36
1     2018-04-30 19:00:00 00:15:28 2018-04-30 19:15:28
2     2018-04-29 13:15:00 00:52:03 2018-04-29 14:07:03
3     2018-04-29 13:15:00 01:03:42 2018-04-29 14:18:42
4     2018-04-29 13:15:00 00:10:43 2018-04-29 13:25:43

Another solution if want 4 separately columns in output - convert to datetimes only first column and another to timedeltas:

df = pd.read_csv('Datetimetest.csv', parse_dates=[0])

m = df['time'].str.count(':') != 1
df['time'] = pd.to_timedelta('00:' + df['time'].mask(m, df['time'].str.replace(':00$', '')))
df['kickoffTime'] = pd.to_timedelta(df['kickoffTime'])
df['eventdatetime'] = df['kickoffDate'] + df['kickoffTime'] + df['time']
print (df.head())
  kickoffDate kickoffTime     time       eventdatetime
0  2018-04-30    19:00:00 00:47:36 2018-04-30 19:47:36
1  2018-04-30    19:00:00 00:15:28 2018-04-30 19:15:28
2  2018-04-29    13:15:00 00:52:03 2018-04-29 14:07:03
3  2018-04-29    13:15:00 01:03:42 2018-04-29 14:18:42
4  2018-04-29    13:15:00 00:10:43 2018-04-29 13:25:43

EDIT:

If input data is not csv, for convert first column to datetime is possible use to_datetime instead parameter parse_dates in read_csv:

df = pd.read_csv('Datetimetest.csv')

m = df['time'].str.count(':') != 1
df['time'] = pd.to_timedelta('00:' + df['time'].mask(m, df['time'].str.replace(':00$', '')))

df['kickoffDate'] = pd.to_datetime(df['kickoffDate'])
df['kickoffTime'] = pd.to_timedelta(df['kickoffTime'])
df['eventdatetime'] = df['kickoffDate'] + df['kickoffTime'] + df['time']
print (df.head())
  kickoffDate kickoffTime     time       eventdatetime
0  2018-04-30    19:00:00 00:47:36 2018-04-30 19:47:36
1  2018-04-30    19:00:00 00:15:28 2018-04-30 19:15:28
2  2018-04-29    13:15:00 00:52:03 2018-04-29 14:07:03
3  2018-04-29    13:15:00 01:03:42 2018-04-29 14:18:42
4  2018-04-29    13:15:00 00:10:43 2018-04-29 13:25:43

Upvotes: 1

Related Questions