Tom Wojcik
Tom Wojcik

Reputation: 6179

How to prevent csv sniffer from thinking hyphen is UNICODE 8722 instead of 45?

I am working on this thing that will try to figure out csv formatting and ask the end user if it's correct. I started testing with timezones and my test input looks like this.

            "\r\n".join(
                (
                    "timestamp;col1;col2",
                    "2020-01-22T00:14:47-04:00;1;6.1",
                    "2020-02-23T01:15:47-04:00;2;7.1",
                    "2020-02-24T01:15:47-04:00;3;8.1",
                    "2020-02-25T01:15:47-04:00;4;9.1",
                    "2020-02-26T01:15:47-04:00;5;0.",
                )
            ).encode()

To figure out dialect I do

csv.Sniffer().sniff(lookup_row, self.allowed_delimiters)

I load this file with

csv.reader(opened_csv, dialect=dialect)

Here's the fun part. When I copy-paste the first timestamp and parse it to datetime

from dateutil import parser

a = '2020-01-22T00:14:47-04:00'
found_val = parser.parse(a)

it properly returns the datetime. But when I run this input through sniff, csv.reader and iterate over rows in my test dateutil can't parse it

b = '2020-01-22T00:14:47−04:00'  # <-- in my test case

and

a == b
>>> False

So when I looked closer

a_ord = [ord(char) for char in a]  # [50, 48, 50, 48, 45, 48, 49, 45, 50, 50, 84, 48, 48, 58, 49, 52, 58, 52, 55, 45, 48, 52, 58, 48, 48]
b_ord = [ord(char) for char in b]  # [50, 48, 50, 48, 45, 48, 49, 45, 50, 50, 84, 48, 48, 58, 49, 52, 58, 52, 55, 8722, 48, 52, 58, 48, 48]

The diff is - sign near timezone. Apparently "raw" copy-paste results in minus which is UNICODE 45 while sniffer thinks (?) it's 8722.

My mind is blown, especially because the rest of the hyphens in this cell are considered to be 45.

As it's a special case scenario and I care only about the proper parsing of this column, is replacing this character (if found) the best way to go about that?

Or can I somehow define in sniffer that's a wrong character/limit UNICODE scope?

Should it be considered a bug in dateutil?

Upvotes: 0

Views: 229

Answers (1)

Paul M.
Paul M.

Reputation: 10799

Should it be considered a bug in dateutil?

I wouldn't say that. I don't know anything about dateutil, but I'd say you just aren't taking advantage of the features available to you.

Looking at the documentation for dateutil.parser.parse, it looks like you can pass an optional dateutil.parser.parserinfo object that describes what constitutes acceptable input.

Specifically, I think you'll want to look at dateutil.parser.parserinfo.JUMP, which seems to be a list of acceptable separators, which looks like this by default:

JUMP= [' ', '.', ',', ';', '-', '/', "'", 'at', 'on', 'and', 'ad', 'm', 't', 'of', 'st', 'nd', 'rd', 'th']

So, I'm guessing, all you have to do is pass in one of these parserinfo objects with a custom JUMP that includes your special hyphen.

Upvotes: 1

Related Questions