Ilian Iliev
Ilian Iliev

Reputation: 3236

Python and parsing unicode files

A few weeks ago I wrote a CSV parser in python and it was working great with the provided text file. But when we tried to test is with other files the problems started.

First was the

ValueError: empty string for float()

for a string like "313.44". The problem was that in unicode there was some empty bytes betwee the numbers '\x0'.

Ok I decoded to read it as an unicode with

codecs.open(filename, 'r', 'utf-16')

And then the hell opened, missing BOM, problems with the line end characters (LF vs CR+LF) etc.

So can you provide me or give me hint for a workaround about parsing unicode and non-unicode files if I do not know what the encoding is, is BOM present, what line ending are etc.

P.S. I am using Python 2.7

Upvotes: 1

Views: 1614

Answers (2)

Ilian Iliev
Ilian Iliev

Reputation: 3236

The problem was solved using the csv module as proposed by Daenyth

Upvotes: 1

Moss
Moss

Reputation: 6012

It mainly depends on the Python version you are using but those 2 links shopuld help you out:

Upvotes: 0

Related Questions