Reputation: 6179
I am working on this thing that will try to figure out csv formatting and ask the end user if it's correct. I started testing with timezones and my test input looks like this.
"\r\n".join(
(
"timestamp;col1;col2",
"2020-01-22T00:14:47-04:00;1;6.1",
"2020-02-23T01:15:47-04:00;2;7.1",
"2020-02-24T01:15:47-04:00;3;8.1",
"2020-02-25T01:15:47-04:00;4;9.1",
"2020-02-26T01:15:47-04:00;5;0.",
)
).encode()
To figure out dialect I do
csv.Sniffer().sniff(lookup_row, self.allowed_delimiters)
I load this file with
csv.reader(opened_csv, dialect=dialect)
Here's the fun part. When I copy-paste the first timestamp and parse
it to datetime
from dateutil import parser
a = '2020-01-22T00:14:47-04:00'
found_val = parser.parse(a)
it properly returns the datetime. But when I run this input through sniff, csv.reader and iterate over rows in my test dateutil
can't parse it
b = '2020-01-22T00:14:47−04:00' # <-- in my test case
and
a == b
>>> False
So when I looked closer
a_ord = [ord(char) for char in a] # [50, 48, 50, 48, 45, 48, 49, 45, 50, 50, 84, 48, 48, 58, 49, 52, 58, 52, 55, 45, 48, 52, 58, 48, 48]
b_ord = [ord(char) for char in b] # [50, 48, 50, 48, 45, 48, 49, 45, 50, 50, 84, 48, 48, 58, 49, 52, 58, 52, 55, 8722, 48, 52, 58, 48, 48]
The diff is -
sign near timezone. Apparently "raw" copy-paste results in minus which is UNICODE 45
while sniffer
thinks (?) it's 8722
.
My mind is blown, especially because the rest of the hyphens in this cell are considered to be 45
.
As it's a special case scenario and I care only about the proper parsing of this column, is replacing this character (if found) the best way to go about that?
Or can I somehow define in sniffer that's a wrong character/limit UNICODE scope?
Should it be considered a bug in dateutil
?
Upvotes: 0
Views: 229
Reputation: 10799
Should it be considered a bug in dateutil?
I wouldn't say that. I don't know anything about dateutil, but I'd say you just aren't taking advantage of the features available to you.
Looking at the documentation for dateutil.parser.parse, it looks like you can pass an optional dateutil.parser.parserinfo object that describes what constitutes acceptable input.
Specifically, I think you'll want to look at dateutil.parser.parserinfo.JUMP, which seems to be a list of acceptable separators, which looks like this by default:
JUMP= [' ', '.', ',', ';', '-', '/', "'", 'at', 'on', 'and', 'ad', 'm', 't', 'of', 'st', 'nd', 'rd', 'th']
So, I'm guessing, all you have to do is pass in one of these parserinfo
objects with a custom JUMP
that includes your special hyphen.
Upvotes: 1