Import csv which have numerics with comma as the decimal separator

Question

I have a 5 gb csv file which takes over an hour to import to stata.

The reason is, among other things, it is in the European format, i.e., it is a ; separated file and numeric columns come with , as de decimal separator. For example:

V1  V2   V3
A   2,4  10,1
B   30   1,4

Problem is that stata assumes the numeric columns are string variables, thus, tries to import it in a really inefficient way (trying to declare the columns as numerics just give me missing values).

Is there a command/option where I input a different decimal-point separator, so the importing process goes faster?

user3396583 · Accepted Answer

I think the functionality you are looking for has been added to Stata 16. The command import delimited has new options parselocale(), groupseparator(), and decimalseparator().

See 10/c from whatsnew15to16 below:

help whatsnew15to16

import delimited is the existing command to import data from delimited text files. It has been enhanced.

a. It is faster. It is 10% faster in general, and 2 to 4 times faster in some cases.

b. It detects delimiters better. In addition to commas and tabs, it now detects pipes, colons, and semicolons.

c. New options allow numeric parsing based on locale. The options are parselocale(), groupseparator(), and decimalseparator().

d. Mismatched quotes in the imported file are reported so that you can fix them.

For the data:

A; 2,4; 10,1  
B; 30; 1,4

You could do the following:

import delimited "D:\data.csv", varnames(nonames) parselocale(es_ES)

or

import delimited "D:\data.csv", varnames(nonames) groupseparator(.) decimalseparator(,)

list

     +-----------------+
     | v1    v2     v3 |
     |-----------------|
  1. |  A   2.4   10.1 |
  2. |  B    30    1.4 |
     +-----------------+

describe

    Contains data
      obs:             2                          
     vars:             3                          
    --------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    --------------------------------------------------------------------------------
    v1              str1    %9s                   
    v2              float   %8.0g                 
    v3              float   %9.0g

Import csv which have numerics with comma as the decimal separator

Answers (2)

Related Questions