Reputation: 331
I want to read the first and last record of a huge number of .csv files (several gigabytes) stored in one folder on a Linux machine. Suppose they are called have1.csv, have2.csv, ...
and so on.
So I tried the following code, which only gives me the first line. But not the last line.
%let datapath = ~/somefolder/;
data want;
length finame $300.;
/*Reference all CSV files in input data folder*/
infile "&datapath.have*.csv" delimiter=","
MISSOVER DSD lrecl=32767 firstobs=2
eov=eov eof=eof filename=finame end=done;
/*Define input format of variables*/
informat Var1 COMMA. Var2 COMMA. Var3 COMMA.;
/*Loop over files*/
do while(not done);
/*Set trailing @ to hold the input open for the next input statement
this is because we have several files */
input @;
/*If first line in file is encountered eov is set to 1,
however, we have firstobs=2, hence all lines would be skipped.
So we need to reset EOV to 0.*/
if eov then
do;
/*Additional empty input statement
handles missing value at first loop*/
input;
eov = 2;
end;
/*First observation*/
if eov=2 then do;
input Var1--Var3;
fname=finame;
output;
eov = 0;
end;
/*Last observation*/
if 0 then do;
eof: input Var1--Var3;
fname=finame;
output;
end;
input;
end;
stop;
run;
I would very much appreciate your help! If I have misunderstood the concept or the interplay of infile, end, eov, eof and input @, please tell me! I do not know where my mistake is...
Upvotes: 0
Views: 1698
Reputation: 51611
If you have the list of files then the code is clearer. For example if you can use PIPE engine you could use ls (or Dir) command to get filenames. Then use FILEVAR= option to dynamically read each individual file.
data want ;
infile 'ls ~/test/dir1/file*' pipe truncover ;
input fname $200.;
filename=fname;
infile csv filevar=fname dsd truncover firstobs=2 end=eof ;
do _n_=1 by 1 while (not eof);
input str :$30. ;
if _N_=1 or eof then output;
end;
run;
Or if your files are large you could take advantage of using PIPE to use head
and tail
commands to find the beginning and ending of each file without needing to have SAS read the whole file. You would probably need to test to see if it actually improved the performance.
data want ;
infile 'ls ~/test/dir1/file*' pipe truncover ;
input filename $200.;
length cmd1 cmd2 $200 ;
cmd1='head -2 '||filename ;
infile top pipe filevar=cmd1 dsd truncover firstobs=2 end=eof1 ;
if (not eof1) then do;
input str :$30. ;
output;
end;
cmd2='tail -1 '||filename ;
infile bottom pipe filevar=cmd2 dsd truncover firstobs=1 end=eof2;
if (not eof2) then do;
input str :$30. ;
output;
end;
run;
Upvotes: 1
Reputation: 51611
If you want to use wildcards in the INFILE statement you can use the EOV= option to create a variable that will flag when a new file starts. Note that you need to manually reset the EOV flag.
Read and hold the line before reading the values so that you can test if a new file has started. That way you can output the last line from the previous file. You will also need to RETAIN your input variables so that the values from the last line of the previous file are available.
You also need to use the END= option to be able to output the last line of the last file.
Example:
data want ;
retain filename str;
length fname filename $200 ;
infile '/dir1/file*' filename=fname eov=eov end=eof truncover ;
input @;
if eov then output;
filename=fname ;
input str $30. ;
if _n_=1 or eov or eof then output;
eov=0;
run;
Example output:
Obs filename str
1 /dir1/file1 Line1
2 /dir1/file1 Line3
3 /dir1/file2 Line1
4 /dir1/file2 line4
5 /dir1/file3 Line1
6 /dir1/file3 Line3
If you want to skip first line of each file (the header line) add this statement right after the input @;
statement.
if _n_=1 or eov then input;
Note you will need to adjust the logic if it is possible that your input files do not all have at least two data lines (three lines counting header line).
Upvotes: 2
Reputation: 4792
This seems to work for me, please try it:
data want;
length finame $300.;
/*Reference all CSV files in input data folder*/
infile "E:\temp\test\have*.txt" delimiter=","
MISSOVER DSD lrecl=32767
eov=eov filename=finame end=done;
/* Note: firstobs option seems to work on first file only */
/*Define input format of variables*/
informat Var1 COMMA. Var2 COMMA. Var3 COMMA.;
input; /* skip header in first file */
input Var1--Var3; /* read first real record in first file */
fname=finame;
output;
/* Loop over files*/
do while(not done);
input @;/* try input do determine eov condition */
if eov then do;/* new file detected - we're on header record, but variables contain values from previous record - see "read values" */
output; /* variables contain values from previous record - output those values */
input; /* skip header */
eov = 0;
input Var1--Var3; /* read first real observation */
fname=finame;
output; /* first line of new file */
end;
input Var1--Var3; /* read values - it might be last record */
end;
output; /* output last record of last file */
run;
Actually, as Tom below describes, there's no need for while loop (dangerous thing :-) ). I've modified the code now: (need to add RETAINs because we're looping in data step itself)
data want;
length finame $300.;
/*Reference all CSV files in input data folder*/
infile "E:\temp\test\have*.txt" delimiter=","
MISSOVER DSD lrecl=32767
eov=eov filename=finame end=done;
informat Var1 COMMA. Var2 COMMA. Var3 COMMA.;
retain Var1 Var2 Var3 fname;
if _N_ = 1 then do; /* first file */
input; /* skip header in first file */
input Var1--Var3; /* read first real record in first file */
fname=finame;
output;
end;
input @; /* try input do determine eov condition */
if eov then do; /* new file detected - we've moved past header record, but variables contain values from previous record - see "read values" */
output; /* variables contain values from previous record - output those values */
input; /* skip header */
eov = 0;
input Var1--Var3; /* read first real observation */
fname=finame;
output; /* first line of new file */
end;
else input Var1--Var3;
if done then output;
run;
Upvotes: 1