Reputation: 1309
I'm trying to read a Stata .dta file with the python pandas package, using the read_stata() function, and the dta file has many Chinese characters in it. The file read in was all messed up codes, and the Chinese characters were all just gibberish. Any suggestions?
Upvotes: 3
Views: 1654
Reputation: 1121744
You'll need to specify a codec to use, the default is to decode your text as ISO-8859-1 (Latin-1):
pandas.read_stata(filename, encoding=codec_to_use)
See the pandas.read_stata()
documenation:
encoding: string, None or encoding
Encoding used to parse the files. Note that Stata doesn’t support unicode.None
defaults to iso-8859-1.
For Chinese, I'd guess that the codec used is either a gb*
codec (gb18030
, gbk
, gb2312
) or a UTF codec (UTF-8
, UTF-16
, or UTF-32
). In spite of the remark in the Panda's documenation above, I see that Stata 14 supports Unicode now, and that they use UTF-8 for that.
Also see the Standard Encodings page for an overview of supported codecs.
Upvotes: 3