Christian Wilkie
Christian Wilkie

Reputation: 3823

Importing CSV in SAS with long varying length fields wastes disk space

The problem that I am having is that when I import my CSV file I end up getting temporary files of huge size (much much larger than the input file itself). When using compress on some of these files, I get up to 99% reduction of space, but I'm wondering if there is a better way to reduce the size directly. I prepared a simplified example. I'm importing a CSV like the following:

FILENAME CSVFILE("c:\example.csv"); 
DATA CSVINPUT (COMPRESS=yes REUSE=yes);
 INFILE CSVFILE DSD DLM ="," lrecl=15000;
 INFORMAT multiplier 3.0;
 INFORMAT A $char1000. ;
 INFORMAT B $char2000. ;
 INFORMAT C $char2000. ;
 INFORMAT D $char1000. ;
 INFORMAT E $char5000. ;
 INFORMAT F $char4000. ;
 INPUT multiplier
   A $
   B $
   C $
   D $
   E $
   F $ ;
RUN;

The CSV has data of varying widths, column A may have 5 characters, or it may have 1000 characters. Similar for the other columns. Also, some of these columns may have embedded whitespace. I'm not sure if it is helpful to the problem, but multiplier determines the size of these columns. Here is an example of the CSV data file format:

4,aaaa,bbbbbbbb,    cc  ,dddd,eeeeeeeeeeeeeeeeeeee,ffffffffffff
1,a,bb,  ,d,eeeee,ffff
2,aa,bbbb,cc  ,eeeeeeeeee,ffffffff

Is there any way I can import this data more efficiently, without truncating any values (or blanks, for example column C)? This seems like it would be a common/simple problem to fix, but I haven't been able to find a working solution yet, so any help in the right direction would be appreciated!

Upvotes: 3

Views: 1871

Answers (2)

Chris J
Chris J

Reputation: 7769

As you read in your file, you could dynamically determine the maximum length of each variable, then re-write your dataset defining the variables to the maximum lengths.

But as Joe states, using COMPRESS=CHAR will essentially reduce disk consumption to the minimum needed to store the non-blank data. In comparison, using the dynamic method plus compression may only reduce it by a fraction more.

data char_lengths ;
  set csvinput end=eof ;

  array n{*} $5000. A-F ;
  array l{*} l1-l6 ;
  retain l . ;

  do i = 1 to dim(n) ;
    l{i} = max(l{i},length(n{i})) ;
  end ;

  if eof then do ;
    do i = 1 to dim(n) ;
      call symputx(cats('LEN',i),l{i}) ;
      call symputx(cats('VAR',i),vname(n{i})) ;
    end ;
    output ;
  end ;
run ;

%MACRO VARLEN ;
  %DO I = 1 %TO 6 ;
    &&VAR&I.._new $&&LEN&I...
  %END ;
%MEND ;

%MACRO VARASSIGN ;
  %DO I = 1 %TO 6 ;
    &&VAR&I.._new = &&VAR&I ;
    drop &&VAR&I ;
  %END ;
%MEND ;

data shrink (compress=char) ;
  length %VARLEN ;
  set csvinput ;

  %VARASSIGN ;
run ;

Upvotes: 1

Joe
Joe

Reputation: 63434

SAS is a fixed-width database system; it doesn't have the concept of varchar like most SQL implementations do. That means for any given SAS dataset, a column has a fixed width (number of bytes of storage assigned to it). If you want to have sometimes 1 or 2 characters and sometimes 1500 in a column (variable), you have to assign 1500 to the column and deal with the waste. It makes SAS faster at accessing data (as it's a predictable number of sectors over) but wastes quite a lot of space if your data is structured like the above.

OPTIONS COMPRESS=CHAR is the correct way to deal with this, as it will then compress the data and remove the wasted space. Other than restructuring your data in some fashion to avoid having columns with large amounts of wasted space, that's your only real option.

Upvotes: 5

Related Questions