Problems parsing a text file (encoding?)

Question

I'm trying to parse a text file (a Valve-KeyValues language file) and I'm encountering some problems. I'm using this library to parse other KeyValues files and it works perfectly from what I could gather, but for the language file it just returns an empty dict.
I tried some simple things like iterating over all the lines in the file and checking if a string exists (I know that the string exists just by looking at the file) and it never finds it. Single characters seem to work though.
If I print the lines directly into the console it looks as if there was a space in-between every character. I uploaded the file to my google drive here.

It is a language file so I guess it could be stored in some different encoding but I couldn't find anything via google/I don't really know what to search for here.

wildwilhelm · Accepted Answer

Indeed, it seems like your file is encoded as UTF-16:

$ file ~/Downloads/dota_english.txt
~/Downloads/dota_english.txt: Little-endian UTF-16 Unicode C++ program 
text, with very long lines, with CRLF line terminators

This fits with your description of seeing "a space in-between every character" (UTF-16 is a two-byte encoding; for ASCII text this will mean that each character is represented as a byte of ASCII, followed by a null byte, giving the spaces in the text).

You could try specifying the encoding while loading the file, for instance using the codecs module:

import codecs
import vdf
d = vdf.load(codecs.open('dota_english.txt', 'r', encoding='utf-16'))

Problems parsing a text file (encoding?)

Answers (2)

Related Questions