Reputation: 8636
I am trying to read a csv file with pandas. the file is very long and malformed in the middle like so
Date,Received Date,Tkr,Theta,Wid,Per
2007-08-03,2017/02/13 05:30:G,F,B,A,1
2007-08-06,2017/02/13 05:30:G,F,A,B,1
2007-08-07,2017/02/13 05:30:G,F,A,B,1
2007-08-,nan,,,,
2000-05-30 00:00:00,2017/02/14 05:30:F,D,10,1,1
2000-05-31 00:00:00,2017/02/14 05:30:F,D,10,1,1
My line which is failing is this:
full_frame = pd.read_csv(path, parse_dates=["Date"],error_bad_lines=False).set_index("Date").sort_index()[:date]
with the error
TypeError: unorderable types: str() > datetime.datetime()
File "/A/B/C.py", line 236, in load_ex
full_frame = pd.read_csv(path, parse_dates=["Date"],error_bad_lines=False).set_index("Date").sort_index()[:date]
date
is just a variable that holds a given input date.
This is happening because of the broken line in the middle. I have tried to do
error_bad_line=False
but that wont prevent my script from failing.
When i take out the bad line from my csv and run it, it works fine. This csv will be used as an input and I cant modify it at source so I was wondering if there is a way to skip a line based on length of the line in the csv in pandas or something else I can do to make it work without duplicating/modifyng the file
UPDATE
The bad line is stored in my data frame if i simply do a
read_csv
as 2007-08- NaN NaN NaN NaN NaN
UPDATE 2:
if i try to just do
full_frame = pd.read_csv(path, parse_dates=["Date"],error_bad_lines=False)
full_frame = full_frame.dropna(how="any")
# this drops the NaN row for sure
full_frame = full_frame.set_index("Date").sort_index()[:date]
still gives same error :(
Upvotes: 1
Views: 1346
Reputation: 2567
So I gave this a quick shot. Your data has inconsistencies which should may be of concern to you for your analysis, and you should investigate. Analysis is only as good as that data quality is.
Here's some code (not the best, but gets the job mostly done)
First, since your data needs some work, I read it in as raw text. Then I write a function to parse the dates. I collect the columns in one list, and the rest of the data in another.
For all the data that needs to have dates, I loop over the data 1 line at a time and pass it through parse_dates.
parse_dates
works by reading in a list, grabbing the first item in the list (the date part) then trying to convert it from a simple string to a date. Since not all are datetime, I only grab the first 10 bytes for just dates.
Once I have a cleaner data, I pass it through pandas
and obtain a dataframe. Then I set the date to the index. This could be improved upon but given that this is not my job, I'll let you do the rest.
import pandas as pd
import datetime as dt
rawdata = []
with open("test.dat", "r") as stuff:
for line in stuff:
line1 = line[:-1]
rawdata.append(line1.split(","))
def parse_dates(line):
datepart = line[0][:10] ## get the date-time, and for the date-time, only get the date part
## since not all rows have date + time, cut it down to date
try:
result = dt.datetime.strptime(datepart, "%Y-%m-%d") ## try converting to date
except ValueError:
result = None
line[0] = result ## update
return line
cols = rawdata[0]
data = rawdata[1:]
print data
data = [parse_dates(line) for line in data]
print data
df = pd.DataFrame(data = data, columns = cols)
print df
df.index = df['Date']
Also, a simple Google search shows plenty of ways of handling dates with Python+pandas. Here is one link I found: https://chrisalbon.com/python/strings_to_datetime.html
Upvotes: 2