Reputation: 71
I have a CSV file which looks like this:
time, Numbers
[30/Apr/1998:21:30:17,24736
[30/Apr/1998:21:30:53,24736
[30/Apr/1998:21:31:12,24736
[30/Apr/1998:21:31:19,3781
[30/Apr/1998:21:31:22,-
[30/Apr/1998:21:31:27,24736
[30/Apr/1998:21:31:29,-
[30/Apr/1998:21:31:29,-
[30/Apr/1998:21:31:32,929
[30/Apr/1998:21:31:43,-
[30/Apr/1998:21:31:44,1139
[30/Apr/1998:21:31:52,24736
[30/Apr/1998:21:31:52,3029
[30/Apr/1998:21:32:06,24736
[30/Apr/1998:21:32:16,-
[30/Apr/1998:21:32:16,-
[30/Apr/1998:21:32:17,-
[30/Apr/1998:21:32:30,14521
[30/Apr/1998:21:32:33,11324
[30/Apr/1998:21:32:35,24736
[30/Apr/1998:21:32:3l8,671
[30/Apr/1998:21:32:38,1512
[30/Apr/1998:21:32:38,1136
[30/Apr/1998:21:32:38,1647
[30/Apr/1998:21:32:38,1271
[30/Apr/1998:21:32:52,5933
[30/Apr/1998:21:32:58,-
[30/Apr/1998:21:32:59,231
upto one billion,
forget about numbers column, I have a concern to convert this time-date format in my CSV file to pandas time stamp, so I can plot my dataset and visualize it according to time, as I am new in datascience,here is my approach:
step 1: take all the time colum from my CSV file into an array,
step 2: split the data from the mid where :(colon) occurs, make two new arrays of date and time,
step 3: remove "[" from date array,
step 4: replace all forward slash into dashes in the date array,
step 5: and then append date and time array to make a single pandas format,
which will be looks like this, 2017-03-22 15:16:45
as you known that I am new and my approach is naive and also wrong, if someone can help me with providing me code snippet, I will be really happy, thanks
Upvotes: 1
Views: 377
Reputation: 18916
You can pass a format to pd.to_datetime()
, in this case: [%d/%b/%Y:%H:%M:%S
.
Be careful with erroneous data though as seen in row 3 in sample data below ([30/Apr/1998:21:32:3l8,671). To not get an error you can pass errors=coerce
, will return Not a Time (NaT).
The other way would be to replace those rows manually or write some sort of regex/replace funtion first.
import pandas as pd
data = '''\
time, Numbers
[30/Apr/1998:21:30:17,24736
[30/Apr/1998:21:30:53,24736
[30/Apr/1998:21:32:3l8,671
[30/Apr/1998:21:32:38,1512
[30/Apr/1998:21:32:38,1136
[30/Apr/1998:21:32:58,-
[30/Apr/1998:21:32:59,231'''
fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, sep=',', na_values=['-'])
df['time'] = pd.to_datetime(df['time'], format='[%d/%b/%Y:%H:%M:%S', errors='coerce')
print(df)
Returns:
time Numbers
0 1998-04-30 21:30:17 24736.0
1 1998-04-30 21:30:53 24736.0
2 NaT 671.0
3 1998-04-30 21:32:38 1512.0
4 1998-04-30 21:32:38 1136.0
5 1998-04-30 21:32:58 NaN
6 1998-04-30 21:32:59 231.0
Note that: na_values=['-']
was used here to help pandas understand the Numbers column is actually numbers and not strings.
And now we can perform actions like grouping (on minute for instance):
print(df.groupby(df.time.dt.minute)['Numbers'].mean())
#time
#30.0 24736.000000
#32.0 959.666667
Upvotes: 2