Reputation: 2652
I have a 5 gb csv
file which takes over an hour to import to stata.
The reason is, among other things, it is in the European format, i.e., it is a ;
separated file and numeric columns come with ,
as de decimal separator. For example:
V1 V2 V3
A 2,4 10,1
B 30 1,4
Problem is that stata
assumes the numeric columns are string variables, thus, tries to import it in a really inefficient way (trying to declare the columns as numerics just give me missing values).
Is there a command/option where I input a different decimal-point separator, so the importing process goes faster?
Upvotes: 0
Views: 5127
Reputation: 149
I think the functionality you are looking for has been added to Stata 16. The command import delimited
has new options parselocale()
, groupseparator()
, and decimalseparator()
.
See 10/c
from whatsnew15to16 below:
help whatsnew15to16
import delimited is the existing command to import data from delimited text files. It has been enhanced.
a. It is faster. It is 10% faster in general, and 2 to 4 times faster in some cases.
b. It detects delimiters better. In addition to commas and tabs, it now detects pipes, colons, and semicolons.
c. New options allow numeric parsing based on locale. The options are parselocale(), groupseparator(), and decimalseparator().
d. Mismatched quotes in the imported file are reported so that you can fix them.
For the data:
A; 2,4; 10,1
B; 30; 1,4
You could do the following:
import delimited "D:\data.csv", varnames(nonames) parselocale(es_ES)
or
import delimited "D:\data.csv", varnames(nonames) groupseparator(.) decimalseparator(,)
list
+-----------------+
| v1 v2 v3 |
|-----------------|
1. | A 2.4 10.1 |
2. | B 30 1.4 |
+-----------------+
describe
Contains data
obs: 2
vars: 3
--------------------------------------------------------------------------------
storage display value
variable name type format label variable label
--------------------------------------------------------------------------------
v1 str1 %9s
v2 float %8.0g
v3 float %9.0g
Upvotes: 1
Reputation:
If your data file looks like this:
A; 2,4; 10,1
B; 30; 1,4
You could do the following:
import delimited whatever_filename.txt, delimiters(";") varnames(nonames)
destring v2 v3, dpcomma replace
list
+-----------------+
| v1 v2 v3 |
|-----------------|
1. | A 2.4 10.1 |
2. | B 30 1.4 |
+-----------------+
As of Stata version 15, there isn't a way of doing this in one step. I think the only other solution is to pre-process the data file by changing the comma to a period. Excel could do this easily.
Upvotes: 2