Khan Hassan
Khan Hassan

Reputation: 71

transform raw date format into pandas date object

I have a CSV file which looks like this:

time, Numbers
[30/Apr/1998:21:30:17,24736
[30/Apr/1998:21:30:53,24736
[30/Apr/1998:21:31:12,24736
[30/Apr/1998:21:31:19,3781
[30/Apr/1998:21:31:22,-
[30/Apr/1998:21:31:27,24736
[30/Apr/1998:21:31:29,-
[30/Apr/1998:21:31:29,-
[30/Apr/1998:21:31:32,929
[30/Apr/1998:21:31:43,-
[30/Apr/1998:21:31:44,1139
[30/Apr/1998:21:31:52,24736
[30/Apr/1998:21:31:52,3029
[30/Apr/1998:21:32:06,24736
[30/Apr/1998:21:32:16,-
[30/Apr/1998:21:32:16,-
[30/Apr/1998:21:32:17,-
[30/Apr/1998:21:32:30,14521
[30/Apr/1998:21:32:33,11324
[30/Apr/1998:21:32:35,24736
[30/Apr/1998:21:32:3l8,671
[30/Apr/1998:21:32:38,1512
[30/Apr/1998:21:32:38,1136
[30/Apr/1998:21:32:38,1647
[30/Apr/1998:21:32:38,1271
[30/Apr/1998:21:32:52,5933
[30/Apr/1998:21:32:58,-
[30/Apr/1998:21:32:59,231
upto one billion,

forget about numbers column, I have a concern to convert this time-date format in my CSV file to pandas time stamp, so I can plot my dataset and visualize it according to time, as I am new in datascience,here is my approach:

step 1: take all the time colum from my CSV file into an array,
step 2: split the data from the mid where :(colon) occurs, make two new arrays of date and time,
step 3: remove "[" from date array,
step 4: replace all forward slash into dashes in the date array,
step 5: and then append date and time array to make a single pandas format,

which will be looks like this, 2017-03-22 15:16:45 as you known that I am new and my approach is naive and also wrong, if someone can help me with providing me code snippet, I will be really happy, thanks

Upvotes: 1

Views: 377

Answers (1)

Anton vBR
Anton vBR

Reputation: 18916

You can pass a format to pd.to_datetime(), in this case: [%d/%b/%Y:%H:%M:%S. Be careful with erroneous data though as seen in row 3 in sample data below ([30/Apr/1998:21:32:3l8,671). To not get an error you can pass errors=coerce, will return Not a Time (NaT).

The other way would be to replace those rows manually or write some sort of regex/replace funtion first.

import pandas as pd

data = '''\
time, Numbers
[30/Apr/1998:21:30:17,24736
[30/Apr/1998:21:30:53,24736
[30/Apr/1998:21:32:3l8,671
[30/Apr/1998:21:32:38,1512
[30/Apr/1998:21:32:38,1136       
[30/Apr/1998:21:32:58,-      
[30/Apr/1998:21:32:59,231'''

fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, sep=',', na_values=['-'])

df['time'] = pd.to_datetime(df['time'], format='[%d/%b/%Y:%H:%M:%S', errors='coerce')
print(df)

Returns:

                 time   Numbers
0 1998-04-30 21:30:17   24736.0
1 1998-04-30 21:30:53   24736.0
2                 NaT     671.0
3 1998-04-30 21:32:38    1512.0
4 1998-04-30 21:32:38    1136.0
5 1998-04-30 21:32:58       NaN
6 1998-04-30 21:32:59     231.0

Note that: na_values=['-'] was used here to help pandas understand the Numbers column is actually numbers and not strings.


And now we can perform actions like grouping (on minute for instance):

print(df.groupby(df.time.dt.minute)['Numbers'].mean())

#time
#30.0    24736.000000
#32.0      959.666667

Upvotes: 2

Related Questions