Olivier Ma
Olivier Ma

Reputation: 1309

Chinese characters all become gibberish when using pandas read_stata() function

I'm trying to read a Stata .dta file with the python pandas package, using the read_stata() function, and the dta file has many Chinese characters in it. The file read in was all messed up codes, and the Chinese characters were all just gibberish. Any suggestions?

Upvotes: 3

Views: 1654

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1121744

You'll need to specify a codec to use, the default is to decode your text as ISO-8859-1 (Latin-1):

pandas.read_stata(filename, encoding=codec_to_use)

See the pandas.read_stata() documenation:

encoding: string, None or encoding
Encoding used to parse the files. Note that Stata doesn’t support unicode. None defaults to iso-8859-1.

For Chinese, I'd guess that the codec used is either a gb* codec (gb18030, gbk, gb2312) or a UTF codec (UTF-8, UTF-16, or UTF-32). In spite of the remark in the Panda's documenation above, I see that Stata 14 supports Unicode now, and that they use UTF-8 for that.

Also see the Standard Encodings page for an overview of supported codecs.

Upvotes: 3

Related Questions