Chinese characters all become gibberish when using pandas read_stata() function

Question

I'm trying to read a Stata .dta file with the python pandas package, using the read_stata() function, and the dta file has many Chinese characters in it. The file read in was all messed up codes, and the Chinese characters were all just gibberish. Any suggestions?

Martijn Pieters · Accepted Answer

You'll need to specify a codec to use, the default is to decode your text as ISO-8859-1 (Latin-1):

pandas.read_stata(filename, encoding=codec_to_use)

See the pandas.read_stata() documenation:

encoding: string, None or encoding
Encoding used to parse the files. Note that Stata doesn’t support unicode. None defaults to iso-8859-1.

For Chinese, I'd guess that the codec used is either a gb* codec (gb18030, gbk, gb2312) or a UTF codec (UTF-8, UTF-16, or UTF-32). In spite of the remark in the Panda's documenation above, I see that Stata 14 supports Unicode now, and that they use UTF-8 for that.

Also see the Standard Encodings page for an overview of supported codecs.

Chinese characters all become gibberish when using pandas read_stata() function

Answers (1)

Related Questions