Ruxuan  Ouyang
Ruxuan Ouyang

Reputation: 791

Importing foreign languages from csv file to Stata

I am using Stata 12. I have encountered the following problems. I am importing a bunch of .csv files to Stata using the insheet command. The datasets may conclude Russian, Croatian, Turkish, etc. I think they are encoded in "UTF-8". In .csv files, they are correct. After I imported them into Stata, the original strings are incorrect and become the strange characters. Would you please help me with that? Does Stat-Transfer can solve the problems? Does it support .csv format?

For example, the original file is like: enter image description here

My code is like: insheet using name.csv, c n save name.dta,replace

The result is like: enter image description here

And I have tried to adjust the script in the fonts option, which does not work.

Upvotes: 4

Views: 7067

Answers (2)

Alexis
Alexis

Reputation: 854

Update Answer: As of version 14, all of Stata is Unicode aware. That is results, help files, do files, ado files, data labels, etc.

This does not help users limited to accessing versions of Stata before 14, but is one kind of solution. Using the OP's example:

. insheet using "/home/Alexis/Desktop/data.csv"
(3 vars, 4 obs)

. ed

. list

     +------------------------------------------------------------------------------+
     |         v1    v2                                                          v3 |
     |------------------------------------------------------------------------------|
  1. | RU00040778   RUS                                  ПРAЙCBOTEРXAУCKУПEРC AУДИT |
  2. | RU00044434   RUS                                                        КПMГ |
  3. | RU00044428   RUS                                               Эрнст энд Янг |
  4. | RU00044428   RUS   Аудиторско-консулбтационная группа Раэвитие Биэнес-систем |
     +------------------------------------------------------------------------------+

Upvotes: 2

I.M.
I.M.

Reputation: 181

As @Nick Cox commented earlier, the problem is that Stata just doesn't support Unicode/UTF-8 encoding. No, StatTransfer wouldn't resolve the problem (please refer to this explanation).

You can do the trick using an online decoder or MS Word. Let's do it with one language first, say, Russian as in your screenshots. Check out the correct encodings for Croatian, Turkish, and other languages you have.

  1. Save the string variable from your .csv file as plain text (.txt), choosing the UTF-8 encoding option.
  2. Encoding conversion:
    • Use iconv, suggested by @Dimitriy V. Masterov, or
    • Use an online tool, such as this: upload .txt file, choose source encoding as UTF-8 and output encoding according to the language of interest (for Russian, it must be CP1251), click "convert" button and save the output file, or
    • If you have MS Office, you can use also MS Word for the same purpose. Right click on .txt file, choose "Open with...", choose to open with MS Word. In the appeared window, confirm that the file encoding is "Unicode (UTF-8)", open, then click "Save as...", save as plain text. In the newly appeared window, choose "Cyrillic (Windows)" and mark "Insert line breaks". Save.
  3. Check out your new .txt file - it still should have some strange characters (like ÌßÑÎÊÎÌÁÈÍÀÒ) but now Stata can display them properly.
  4. Copy-paste the new string variable in Stata Data Editor, right click on the variable, choose "Font...", and then string "Cyrillic". You should see correct names on the screen both in data editor and in the results window (even though the string itself is intact).
    example of the CP1251 encoding in Stata

Depending on your OS, you might need to install all appropriate languages first.
Hope it helps.

Upvotes: 2

Related Questions