Felipe Alvarenga
Felipe Alvarenga

Reputation: 2652

Import csv which have numerics with comma as the decimal separator

I have a 5 gb csv file which takes over an hour to import to stata.

The reason is, among other things, it is in the European format, i.e., it is a ; separated file and numeric columns come with , as de decimal separator. For example:

V1  V2   V3
A   2,4  10,1
B   30   1,4

Problem is that stata assumes the numeric columns are string variables, thus, tries to import it in a really inefficient way (trying to declare the columns as numerics just give me missing values).

Is there a command/option where I input a different decimal-point separator, so the importing process goes faster?

Upvotes: 0

Views: 5127

Answers (2)

user3396583
user3396583

Reputation: 149

I think the functionality you are looking for has been added to Stata 16. The command import delimited has new options parselocale(), groupseparator(), and decimalseparator().

See 10/c from whatsnew15to16 below:

help whatsnew15to16
  1. import delimited is the existing command to import data from delimited text files. It has been enhanced.

    a. It is faster. It is 10% faster in general, and 2 to 4 times faster in some cases.

    b. It detects delimiters better. In addition to commas and tabs, it now detects pipes, colons, and semicolons.

    c. New options allow numeric parsing based on locale. The options are parselocale(), groupseparator(), and decimalseparator().

    d. Mismatched quotes in the imported file are reported so that you can fix them.

For the data:

A; 2,4; 10,1  
B; 30; 1,4

You could do the following:

import delimited "D:\data.csv", varnames(nonames) parselocale(es_ES)

or

import delimited "D:\data.csv", varnames(nonames) groupseparator(.) decimalseparator(,) 
list

     +-----------------+
     | v1    v2     v3 |
     |-----------------|
  1. |  A   2.4   10.1 |
  2. |  B    30    1.4 |
     +-----------------+

describe

    Contains data
      obs:             2                          
     vars:             3                          
    --------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    --------------------------------------------------------------------------------
    v1              str1    %9s                   
    v2              float   %8.0g                 
    v3              float   %9.0g   


Upvotes: 1

user8682794
user8682794

Reputation:

If your data file looks like this:

A; 2,4; 10,1  
B; 30; 1,4

You could do the following:

import delimited whatever_filename.txt, delimiters(";") varnames(nonames)
destring v2 v3, dpcomma replace

list

   +-----------------+
   | v1    v2     v3 |
   |-----------------|
1. |  A   2.4   10.1 |
2. |  B    30    1.4 |
   +-----------------+

As of Stata version 15, there isn't a way of doing this in one step. I think the only other solution is to pre-process the data file by changing the comma to a period. Excel could do this easily.

Upvotes: 2

Related Questions