How to prevent csv sniffer from thinking hyphen is UNICODE 8722 instead of 45?

Question

I am working on this thing that will try to figure out csv formatting and ask the end user if it's correct. I started testing with timezones and my test input looks like this.

            "
".join(
                (
                    "timestamp;col1;col2",
                    "2020-01-22T00:14:47-04:00;1;6.1",
                    "2020-02-23T01:15:47-04:00;2;7.1",
                    "2020-02-24T01:15:47-04:00;3;8.1",
                    "2020-02-25T01:15:47-04:00;4;9.1",
                    "2020-02-26T01:15:47-04:00;5;0.",
                )
            ).encode()

To figure out dialect I do

csv.Sniffer().sniff(lookup_row, self.allowed_delimiters)

I load this file with

csv.reader(opened_csv, dialect=dialect)

Here's the fun part. When I copy-paste the first timestamp and parse it to datetime

from dateutil import parser

a = '2020-01-22T00:14:47-04:00'
found_val = parser.parse(a)

it properly returns the datetime. But when I run this input through sniff, csv.reader and iterate over rows in my test dateutil can't parse it

b = '2020-01-22T00:14:47−04:00'  # <-- in my test case

and

a == b
>>> False

So when I looked closer

a_ord = [ord(char) for char in a]  # [50, 48, 50, 48, 45, 48, 49, 45, 50, 50, 84, 48, 48, 58, 49, 52, 58, 52, 55, 45, 48, 52, 58, 48, 48]
b_ord = [ord(char) for char in b]  # [50, 48, 50, 48, 45, 48, 49, 45, 50, 50, 84, 48, 48, 58, 49, 52, 58, 52, 55, 8722, 48, 52, 58, 48, 48]

The diff is - sign near timezone. Apparently "raw" copy-paste results in minus which is UNICODE 45 while sniffer thinks (?) it's 8722.

My mind is blown, especially because the rest of the hyphens in this cell are considered to be 45.

As it's a special case scenario and I care only about the proper parsing of this column, is replacing this character (if found) the best way to go about that?

Or can I somehow define in sniffer that's a wrong character/limit UNICODE scope?

Should it be considered a bug in dateutil?

How to prevent csv sniffer from thinking hyphen is UNICODE 8722 instead of 45?

Answers (1)

Related Questions