Reputation: 675
I'm using SAS University Edition 9.4
This is my CSV data.
,MGAAAAAAAA,3,A0000B 2F1
11111,ハアン12222234222B56122,4,AA 0000
,テストデータ,5,AACHY 2410F1
,テストデタテストテ,5,AACHYF2
This is my infile statement.
data wk01;
infile '/folders/myfolders/data/test_csv.txt'
dsd delimiter=','
lrecl=1000 missover firstobs=1;
input firstcol :$ secondcol :$ thirdcol :$ therest :$;
run ;
I expected my result like this.
But after executing SAS, What I got is as below (the yellow highlight indicates which row/column have its data being truncated by SAS)
For example, the first row's second column is MGAAAAAAAA
but SAS's outut is MGAAAAAA
Could you please point out what am I missing here? Thanks alot.
Upvotes: 0
Views: 1558
Reputation: 51566
The values of your variables are longer than the 8 bytes you are allowing for them. The UTF-8 characters can use up to 4 bytes each. Looks like some of them are getting truncated in the middle, so you get an invalid UTF-8 code.
Just define longer lengths for your variables instead of letting SAS use the default length of 8. In general it is best to explicitly define your variables with a LENGTH or ATTRIB statement. Instead of forcing SAS to guess how to define them based on how you first use them in other statements like INPUT, FORMAT, INFORMAT or assignment.
data wk01;
infile '/folders/myfolders/data/test_csv.txt' dsd dlm=',' truncover ;
length firstcol $8 secondcol $30 thirdcol $30 therest $100;
input firstcol secondcol thirdcol therest;
run ;
Upvotes: 3
Reputation: 1297
I think what you have is a mixed encoding problem. What's essentially happening is that after the first 5 characters which are in ASCII, it changes to UTF8. The commas are getting mixed up in this soup and your standard delimiter gets a bit confused here. You need some manual coding like this to deal with it I think :
data wk01;
infile "test.csv" lrecl=1000 truncover firstobs=1;
input text $utf8x70.;
firstcomma = findc(text,',', 1);
secondcomma = findc(text,',', firstcomma + 1);
thirdcomma = findc(text,',', secondcomma + 1);
fourthcomma = findc(text,',', thirdcomma + 1);
length firstcol $5;
length secondcol $30;
length thirdcol $1;
length fourthcol $30;
firstcol= substr(text,1, firstcomma - 1);
secondcol = substr(text, firstcomma + 1, (secondcomma -firstcomma-1 ));
thirdcol = substr(text, secondcomma + 1, (thirdcomma - secondcomma - 1));
fourthcol = substr(text, thirdcomma + 1);
run;
Probably there is a cleaner way to do it, but this is the quick and dirty method I could come up at 2 AM :)
Upvotes: 1