florian
florian

Reputation: 696

Stata: Correctly import massive delimited text file (from R) with multi line content and row numbers

I fail to import a dataset which I exported from R into Stata 14.1. It contains comments that users wrote to each other. It has the following structure:

    "project_id" "project_title" "project_owner_user_id" "comment_user_user_id" "comment_date" "comment_censored" "censored_reason" "comment_content"
"1" 36 "Vending Machine" "u1" "u1" "2011-06-24 04:03:57" 0 "" "SNOTty vending machine I built for a display model."
"2" 36 "Vending Machine" "u1" "u1" "2011-07-06 01:41:30" 0 "" " Can't wait to share it!  Some technical difficulties finding/getting to the image online.  Stay tuned."
"3" 41 "Animated Stock Car" "u2" "u2" "2011-07-12 00:25:40" 0 "" "Stockcars were an American invention, introduced in the mid 1800s as a way to transport livestock of various kinds. Thousands were used in great blocks by various western railroads to collect cattle from railheads in frontier towns such as Amarillo and Dodge City, and transport them to the great meat packing cities of Chicago and Kansas City, where the slaughterhouses awaited. 

They were also used for horse transport, as this one is."
"4" 40 "Pacific Electric Hollywood Car Trolley" "u3" "u2" "2011-07-12 00:51:29" 0 "" "What kind of trucks did you have in mind for this? My streetcar model uses standard 9V trucks"
"5" 43 "WeCan - 6 times expansion of WeDo -" "u4" "u4" "2011-07-12 23:59:47" 0 "" "Expansion of present WeDo, it is WeCan.
we can use 12 ports of interaction ports."
"6" 44 "WeMust - pneumatic step from WeDo -" "u4" "u4" "2011-07-13 00:07:09" 0 "" "Automatic pneumatic cylinder controller using USB port."

For some reason Stata fails to detect the correct number of variables (in the first line) and adds additional variables: enter image description here

Additionally the contents of the variables are all shifted to the right, because Stata fails to recognize the line numbers at the beginning of the lines. If i manually edit the text file and add an additional variable "line_number" in the first line of the file, this issue is resolved.

To make things worse the file has 652569 observations and contains entries in several languages including Mandarin etc.

The file is UTF-8 encoded. How can I correctly import this file in Stata?

In contrastR does a great job and imports it correctly with a simple oneliner:

X <- read.table ("x.txt")

However my collaborator wants to use Stata. Thanks for your help!

PS: Link to the textfile for reproduction: http://www25.zippyshare.com/v/UKOExjU3/file.html

Upvotes: 0

Views: 2774

Answers (2)

user3396583
user3396583

Reputation: 149

Using Stata 14.2 (14.1 should be the same) ...

    import delimited <yourfile>, delimiter(space, collapse) bindquote(strict) colrange(2) encoding(UTF-8) clear

    describe

Contains data
  obs:             6                          
 vars:             8                          
 size:         3,012                          
---------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------
project_id      byte    %8.0g                 
project_title   str38   %38s                  
project_owner~d str2    %9s                   
comment_user_~d str2    %9s                   
comment_date    str19   %19s                  
comment_censo~d byte    %8.0g                 
censored_reason byte    %8.0g                 
comment_content str438  %438s                 
---------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

Upvotes: 0

Parfait
Parfait

Reputation: 107587

Consider using Stata's insheet command, specifying the space delimiter. Do note: this is a coded solution to be used in .do file or command window and not via Stata's GUI menu per your screenshot:

clear
insheet using "path/to/data.txt", delimiter(" ")

Initial Output

Below uses your linked file and you will notice an unnamed v9 column outputs. And in dataset everything is shifted to the right. The reason being is that in R you are exporting the row.names which do not have column headers and so Stata reads in all named/unnamed columns. So, Stata's project_id aligns to R's row.names (1,2,3,4,5), Stata's project_title for R's project_id, and so on.

Contains data
  obs:             6                          
 vars:             9                          
 size:         1,878 (99.9% of memory free)
-----------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-----------------------------------------------------------------------------
project_id      byte   %8.0g                  
project_title   byte   %8.0g                  
project_owner~d str38  %38s                   
comment_user_~d str2   %9s                    
comment_date    str2   %9s                    
comment_censo~d str19  %19s                   
censored_reason byte   %8.0g                  
comment_content byte   %8.0g                  
v9              str244 %244s                  
-----------------------------------------------------------------------------
Sorted by:  
     Note:  dataset has changed since last saved

Resolution

To resolve above column misalignment, for your text output in R do not output row.names:

write.table(X, file = ("data.txt"), row.names = FALSE)

And if you do want to keep an ID, add it as a new, named column to include in export:

X$ID <- row.names(X)

Even better is using a consistent delimiter like comma or tab using the sep argument. In R, the command write.csv(...) is actually a convenient wrapper for write.table(..., sep = ","):

write.csv(X, file = ("data.txt"), row.names = FALSE)

write.table(X, file = ("data.txt"), row.names = FALSE, sep="\t")

Then in Stata, specify with insheet options being comma or tab:

insheet using "/path/to/data.txt", comma

insheet using "/path/to/data.txt", tab

Upvotes: 1

Related Questions