Ben T
Ben T

Reputation: 13

Problems parsing a text file (encoding?)

I'm trying to parse a text file (a Valve-KeyValues language file) and I'm encountering some problems. I'm using this library to parse other KeyValues files and it works perfectly from what I could gather, but for the language file it just returns an empty dict.
I tried some simple things like iterating over all the lines in the file and checking if a string exists (I know that the string exists just by looking at the file) and it never finds it. Single characters seem to work though.
If I print the lines directly into the console it looks as if there was a space in-between every character. I uploaded the file to my google drive here.

It is a language file so I guess it could be stored in some different encoding but I couldn't find anything via google/I don't really know what to search for here.

Upvotes: 1

Views: 373

Answers (2)

wildwilhelm
wildwilhelm

Reputation: 5019

Indeed, it seems like your file is encoded as UTF-16:

$ file ~/Downloads/dota_english.txt
~/Downloads/dota_english.txt: Little-endian UTF-16 Unicode C++ program 
text, with very long lines, with CRLF line terminators

This fits with your description of seeing "a space in-between every character" (UTF-16 is a two-byte encoding; for ASCII text this will mean that each character is represented as a byte of ASCII, followed by a null byte, giving the spaces in the text).

You could try specifying the encoding while loading the file, for instance using the codecs module:

import codecs
import vdf
d = vdf.load(codecs.open('dota_english.txt', 'r', encoding='utf-16'))

Upvotes: 1

Fame Castle
Fame Castle

Reputation: 33

It looks like a kind of json file with xml in it. Can you upload your source code? There are many json parsers. You can use the built in json module and the xmllib.

Upvotes: 0

Related Questions