dada333
dada333

Reputation: 13

Find the first row where the column header contains specific keywords

I have multiple Excel files saved in a folder, each containing only one sheet with randomly named sheet tabs. I need to find the first row that contains the column header "Job No." and import all the data starting from that column header for each file. There are 11 columns and random rows in each files.

Below is my code for processing a single Excel file to identify the row containing "Job No." However, I'm encountering an error on the line do i = 1 to dim(cols); that says:

ERROR: A loop variable cannot be an array name or a character variable; It must be a scalar numeric.

Can someone help me correct the code and make it applicable for multiple files? Thanks!

%let folder = /mnt;
%let filename = file.xlsx; 

proc import datafile="&folder./&filename."
            out=raw_data
            dbms=xlsx
            replace;
    getnames=no; 
run;

data _null_;
    set raw_data;
    array cols _all_; 
    if _n_ = 1 then do; 
        found = 0; 
        do i = 1 to dim(cols);
            if upcase(vname(cols[i])) = "JOB NO." then do; 
                call symputx('header_row', _n_); 
                found = 1; 
                leave; 
            end;
        end;
        if found = 0 then do;
            put "ERROR: 'Job No.' not found in the first row.";
            call symputx('header_row', 0); 
        end;
    end;
run;

data final_data;
    set raw_data(firstobs=&header_row);
run;

data final_data;
    set final_data;
    if _n_ = 1 then do;
        rename 
            col1 = Job_No
            col2 = Requesting_Client
            col3 = Contact_Full_Name
            col4 = Start_time
            col5 = End_Time
            col6 = Language
            /* there are more columns */
            ;
    end;
run;

proc print data=final_data;
run;

There could be some screenshots and other information that is not needed. I'd like to import column headers and data only.

Upvotes: 0

Views: 77

Answers (3)

Richard
Richard

Reputation: 27498

The error is likely because _all_ encompasses variables of both character and numeric type. Because you are looking for a character string match, make your array of only the character variables.

  array cols _character_ ;

Presuming the column names are arbitrary you can set up the rename statement as a codegen. Note that rename statement is unconditional and not conditional as you show.

  proc contents noprint data=raw out=contents;
  run ;
  %let JobNo_Colname = ;
  data _null_ ;
    set raw ;
    array cols _character_ ;
    do i = 1 to dim(cols) ;
      if cols(i) ne 'JOB NO.' then continue ;
      call symputx ('JobNo_Colname', vname(cols(i))) ;
      call symputx ('firstobs_data', _n_+1) ;
      stop ;
    end ;
  run ;
  proc sort data=contents ; by varnum ; run ;

  %let rename = ;
  data _null_ ;
    set contents end=end ;
    * Note: different code is needed if you want to capture the column names
    * from raw ;
    array newnames (11) $32 _temporary_ (
      'Job_No' 'Requesting_Client' 'Contact_Full_Name' 'Start_time'
      'End_Time' 'Language' etc...
    )
    retain i ;
    if name = "&JobNo_Colname" then i = 0 ;

    if missing (i) then return ;

    i+1;
    length rename $32000 ; retain rename;

    rename = catx(' ', rename, catx('=', name = newnames(i))) ;

    if i = 11 or end then do ;
      call symputx ('rename', &rename) ;
      stop ;
    end ; 
  run ;
  data want ;
    set raw (firstobs=&firstobs_data) ;
    rename &rename ;
  run ;

This will cover one raw. If you have multiple files, each one would be appended to a 'result' data set

proc append base=result append=want ;

Different raw files might not have the same data structures when imported and renamed and thus not be appendable.

Upvotes: 1

Tom
Tom

Reputation: 51566

First tell PROC IMPORT to read the file without using the first row to guess how to name the variables.

proc import dbms=xlsx
  datafile="&folder./&filename."
  out=raw_data replace
;
  getnames=no; 
run;

Now read in the first column, which will be named A, and look for the string you are expecting and store the observation number into a macro variable. Remember to preset the value to 0 in case it is not found.

%let row=0;
data _null_;
  set raw_data (keep=A) ;
  if lowcase(A) = 'job no.' then do;
    call symputx('row',_n_);
    stop;
  end;
run;

Now if the row was found you can use it to set the RANGE used to read the actual data from the file.

%if &row %then %do;
   /* code to read the file, see below */
%end;

You could ask IMPORT to use the actual column headers.

proc import dbms=xlsx
  datafile="&folder./&filename."
  out=WANT replace
;
  range="$A&row:"; 
run;

Or you could ask it to read without guessing the names (in which case start from the next row). You can use the RENAME= dataset option to give them your desired names.

proc import dbms=xlsx
  datafile="&folder./&filename."
  out=WANT(rename=(
        A = Job_No
        B = Requesting_Client
        C = Contact_Full_Name
        D = Start_time
        E = End_Time
        F = Language
     /* fill in the rest */
    ))
  replace
;
  range="$A%eval(&row+1):";
  getnames=no; 
run;

Upvotes: 0

Rud Faden
Rud Faden

Reputation: 411

_all_ is not valid in sas arrays. You need to loop over chars and numbers indenpendently.

data new_dataset;
    set existing_dataset;
    
    /* Define an array that includes all numeric variables */
    array num_vars _numeric_;

    /* Optionally, define an array for character variables */
    array char_vars _character_;
    
    /* Example operation: increment all numeric variables by 1 */
    do i = 1 to dim(num_vars);
        num_vars[i] = num_vars[i] + 1;
    end;
    
    /* Example operation: concatenate '-updated' to all character variables */
    do i = 1 to dim(char_vars);
        char_vars[i] = catx(' ', char_vars[i], '-updated');
    end;

    drop i; /* Remove the index variable from the output dataset */
run;

Upvotes: 0

Related Questions