Dragonfly
Dragonfly

Reputation: 804

Determining the encoding of a file uploaded to Google App Engine

I have a website based on GAE and Python, and I'd like the user to be able to upload a text file for processing. My implementation is based on standard code from the docs (see http://code.google.com/appengine/docs/python/blobstore/overview.html) and my text file upload handler essentially looks like this:

class Uploader(blobstore_handlers.BlobstoreUploadHandler):
    def post(self):
        upload_files = self.get_uploads('file')
        blob_info = upload_files[0]
        blob_reader = blobstore.BlobReader(blob_info.key())
        for line in blob_reader:
            line = line.rstrip().decode('cp1252')
            do_something(line)
        blob_reader.close()

This works fine for a text file encoded with Code Page 1252, which is what you get when using Windows Notepad and saving with what it calls an "ANSI" encoding. But if you use this handler with a file that has been saved with Notepad's UTF-8 encoding, and contains, say, some Cyrillic characters or a u-umlaut, you'll end up with gibberish. For such a file, changing decode('cp1252') to decode('utf_8') will do the trick. (Well, there's also the possibility of a byte order mark (BOM) at the beginning, but that's easily stripped away.)

But how do you know which decoding to use? The BOM isn't guaranteed to be there, and I don't see any other way to know, other than to ask the user—who probably doesn't know either. Is there a reliable method for determining the encoding? I don't necessarily have to use the blobstore if some other means solves it.

And then there's the encoding that Windows Notepad calls "Unicode" which is a UTF-16 little endian encoding. I could find no decoding (including "utf_16_le") that correctly decodes a file saved with this encoding. Can one of these files be read?

Upvotes: 1

Views: 1202

Answers (2)

Dragonfly
Dragonfly

Reputation: 804

Following the response from demalexx, my upload handler now determines the encoding using chardet (http://pypi.python.org/pypi/chardet) which, from what I can tell, works extremely well. Along the way I've discovered that using "for line in blob_reader" to read uploaded text files is extremely troublesome. Instead, if you don't mind reading your entire file in one gulp the solution is easy. (Note the stripping away of one BOM sequence, and the splitting of lines across CR/LF.)

class Uploader(blobstore_handlers.BlobstoreUploadHandler):
    def post(self):
        upload_files = self.get_uploads('file')
        blob_info = upload_files[0]
        text = blobstore.BlobReader(blob_info.key()).read()
        encoding = chardet.detect(text)['encoding']
        if encoding is not None:
            for line in text.decode(encoding).lstrip(u'\ufeff').split(u'\x0d\x0a'):
                do_something(line)

If you want to read piecemeal from your uploaded file, you're in for a world of pain. The problem is that "for line in blob_reader" apparently reads up to where a line-feed (\x0a) byte is found, which is disastrous when reading a utf_16_le encoded file as it chops a \x0a\x00 sequence in half!

I don't recommend it, but here's an upload handler that will successfully process files stored by all the encodings in Windows 7 Notepad (namely, ANSI, UTF-8, Unicode and Unicode big endian) a line at a time. As you can see, stripping away the line termination sequences is cumbersome.

class Uploader(blobstore_handlers.BlobstoreUploadHandler):
    def post(self):
        upload_files = self.get_uploads('file')
        blob_info = upload_files[0]
        blob_reader = blobstore.BlobReader(blob_info.key())
        encoding = chardet.detect(blob_reader.read(10000))['encoding']
        if encoding is not None:
            blob_reader.seek(0)
            for line in blob_reader:
                if line[:2] in ['\xff\xfe','\xfe\xff']:
                    start = 2
                elif line[:3] == '\xef\xbb\xbf':
                    start = 3
                else:
                    start = 0
                if encoding == 'UTF-16BE':
                    if line[-4:] == '\x00\x0d\x00\x0a':
                        line = line[start:-4]
                    elif start > 0:
                        line = line[start:]
                elif encoding == 'UTF-16LE':
                    if line[start] == '\x00':
                        start += 1
                    if line[-3:] == '\x0d\x00\x0a':
                        line = line[start:-3]
                    elif start > 0:
                        line = line[start:]
                elif line[-2:] == '\x0d\x0a':
                    line = line[start:-2]
                elif start > 0:
                    line = line[start:]
                do_something(line.decode(encoding))

This is undoubtedly brittle, and my tests have been restricted to those four encodings, and only for how Windows 7 Notepad creates files. Note that before reading a line at a time I'm grabbing up to 10000 characters for chardet to analyze. That's only a guess as to how many bytes it might need. This clumsy double-read is another reason to avoid this solution.

Upvotes: 1

demalexx
demalexx

Reputation: 4761

May be this will help: Python: Is there a way to determine the encoding of text file?.

Upvotes: 3

Related Questions