EagerToLearn
EagerToLearn

Reputation: 675

CSV data got truncated by SAS

I'm using SAS University Edition 9.4

This is my CSV data.

     ,MGAAAAAAAA,3,A0000B   2F1
11111,ハアン12222234222B56122,4,AA  0000
     ,テストデータ,5,AACHY 2410F1
     ,テストデタテストテ,5,AACHYF2

This is my infile statement.

data wk01;
 infile '/folders/myfolders/data/test_csv.txt'
 dsd delimiter=','   
 lrecl=1000 missover firstobs=1;
 input firstcol  :$  secondcol    :$ thirdcol    :$ therest    :$;
run ;

I expected my result like this.

enter image description here

But after executing SAS, What I got is as below (the yellow highlight indicates which row/column have its data being truncated by SAS)

enter image description here

For example, the first row's second column is MGAAAAAAAA but SAS's outut is MGAAAAAA

Could you please point out what am I missing here? Thanks alot.

Upvotes: 0

Views: 1558

Answers (2)

Tom
Tom

Reputation: 51566

The values of your variables are longer than the 8 bytes you are allowing for them. The UTF-8 characters can use up to 4 bytes each. Looks like some of them are getting truncated in the middle, so you get an invalid UTF-8 code.

Just define longer lengths for your variables instead of letting SAS use the default length of 8. In general it is best to explicitly define your variables with a LENGTH or ATTRIB statement. Instead of forcing SAS to guess how to define them based on how you first use them in other statements like INPUT, FORMAT, INFORMAT or assignment.

data wk01;
  infile '/folders/myfolders/data/test_csv.txt' dsd dlm=',' truncover ;
  length firstcol $8 secondcol $30 thirdcol $30 therest $100;
  input firstcol secondcol thirdcol therest;
run ;

Upvotes: 3

SAS2Python
SAS2Python

Reputation: 1297

I think what you have is a mixed encoding problem. What's essentially happening is that after the first 5 characters which are in ASCII, it changes to UTF8. The commas are getting mixed up in this soup and your standard delimiter gets a bit confused here. You need some manual coding like this to deal with it I think :

data wk01;
  infile "test.csv" lrecl=1000 truncover firstobs=1;
  input text $utf8x70.;

  firstcomma =  findc(text,',', 1);
  secondcomma =  findc(text,',', firstcomma + 1);
  thirdcomma =  findc(text,',', secondcomma + 1);
  fourthcomma =  findc(text,',', thirdcomma + 1);

  length firstcol $5;
  length secondcol $30;
  length thirdcol $1;
  length fourthcol $30;

  firstcol= substr(text,1, firstcomma - 1);
  secondcol = substr(text, firstcomma + 1, (secondcomma -firstcomma-1 ));
  thirdcol = substr(text, secondcomma + 1, (thirdcomma - secondcomma - 1));
  fourthcol = substr(text, thirdcomma + 1);
run;

Probably there is a cleaner way to do it, but this is the quick and dirty method I could come up at 2 AM :)

Upvotes: 1

Related Questions