Importing CSV in SAS with long varying length fields wastes disk space

Question

The problem that I am having is that when I import my CSV file I end up getting temporary files of huge size (much much larger than the input file itself). When using compress on some of these files, I get up to 99% reduction of space, but I'm wondering if there is a better way to reduce the size directly. I prepared a simplified example. I'm importing a CSV like the following:

FILENAME CSVFILE("c:\example.csv"); 
DATA CSVINPUT (COMPRESS=yes REUSE=yes);
 INFILE CSVFILE DSD DLM ="," lrecl=15000;
 INFORMAT multiplier 3.0;
 INFORMAT A $char1000. ;
 INFORMAT B $char2000. ;
 INFORMAT C $char2000. ;
 INFORMAT D $char1000. ;
 INFORMAT E $char5000. ;
 INFORMAT F $char4000. ;
 INPUT multiplier
   A $
   B $
   C $
   D $
   E $
   F $ ;
RUN;

The CSV has data of varying widths, column A may have 5 characters, or it may have 1000 characters. Similar for the other columns. Also, some of these columns may have embedded whitespace. I'm not sure if it is helpful to the problem, but multiplier determines the size of these columns. Here is an example of the CSV data file format:

4,aaaa,bbbbbbbb,    cc  ,dddd,eeeeeeeeeeeeeeeeeeee,ffffffffffff
1,a,bb,  ,d,eeeee,ffff
2,aa,bbbb,cc  ,eeeeeeeeee,ffffffff

Is there any way I can import this data more efficiently, without truncating any values (or blanks, for example column C)? This seems like it would be a common/simple problem to fix, but I haven't been able to find a working solution yet, so any help in the right direction would be appreciated!

Joe · Accepted Answer

SAS is a fixed-width database system; it doesn't have the concept of varchar like most SQL implementations do. That means for any given SAS dataset, a column has a fixed width (number of bytes of storage assigned to it). If you want to have sometimes 1 or 2 characters and sometimes 1500 in a column (variable), you have to assign 1500 to the column and deal with the waste. It makes SAS faster at accessing data (as it's a predictable number of sectors over) but wastes quite a lot of space if your data is structured like the above.

OPTIONS COMPRESS=CHAR is the correct way to deal with this, as it will then compress the data and remove the wasted space. Other than restructuring your data in some fashion to avoid having columns with large amounts of wasted space, that's your only real option.

Importing CSV in SAS with long varying length fields wastes disk space

Answers (2)

Related Questions