Lorem Ipsum
Lorem Ipsum

Reputation: 4534

SAS - Dataset not sorting as expected

In refactoring some code, I am trying to replicate various data sets. I am confirming the refactors in the following manner. I import the live/original data set, sort it by _all_. I sort the refactored data set by _all_. I then compare the two.

%macro   Validate(dataset);

  data live_&dataset.;
    set inFinal.&dataset.;
  run;

  proc sort data = live_&dataset. out = validation_original;
    by _all_;
  run;

  proc sort data = &dataset. out = validation_refactor;
    by _all_;
  run;

  proc compare error note 
    base      = validation_original
    compare   = validation_refactor
    ;
  run;

%mend;

This works but for one bizarre case. The data set contains a begin_date and end_date. PROC COMPARE throws an error on the these values. When I investigate, it happens that there exist two rows containing identical values but for begin_date and end_date. The original dataset is not sorted correctly, despite having been run through PROC SORT.

Refactored Dataset
|-----+-----------+-------------+------+------------+------------+-------+-------+-------|
|  id | numerator | denominator | rate | begin_date | end_date   | junk1 | junk2 | junk3 |
|-----+-----------+-------------+------+------------+------------+-------+-------+-------|
| 123 |         0 |     0.41504 |    0 | 10/01/2015 | 10/31/2015 | ABC   | XYZ   | IOU   |
|-----+-----------+-------------+------+------------+------------+-------+-------+-------|
| 123 |         0 |     0.41504 |    0 | 11/01/2015 | 11/30/2015 | ABC   | XYZ   | IOU   |
|-----+-----------+-------------+------+------------+------------+-------+-------+-------|

Original Dataset
|-----+-----------+-------------+------+------------+------------+-------+-------+-------|
|  id | numerator | denominator | rate | begin_date | end_date   | junk1 | junk2 | junk3 |
|-----+-----------+-------------+------+------------+------------+-------+-------+-------|
| 123 |         0 |     0.41504 |    0 | 11/01/2015 | 11/30/2015 | ABC   | XYZ   | IOU   |
|-----+-----------+-------------+------+------------+------------+-------+-------+-------|
| 123 |         0 |     0.41504 |    0 | 10/01/2015 | 10/31/2015 | ABC   | XYZ   | IOU   |
|-----+-----------+-------------+------+------------+------------+-------+-------+-------|

I cannot account for why this is. The mis-sort only happens with this particular data set. When I examine the contents via PROC CONTENTS they are identical: they have the same number of observations, the same type, the same len, the same format, the same sort order, the same encoding. Everything appears to be identical.

Upvotes: 1

Views: 236

Answers (1)

Lorem Ipsum
Lorem Ipsum

Reputation: 4534

The data set is not sorting as expected because of "denominator". When the sort is performed by all variables except "denominator", the data sets sort as expected and compare as identical.

  proc sort data = live_&dataset. out = validation_original;
    by 
      id
      numerator
      rate
      begin_date
      end_date
      junk1
      junk2
      junk3
    ;
  run;

The method of calculating "denominator" differs between the original and refactor. This likely introduces a slight bit of representation error.

To implement the validation, I did the following. I rearranged the refactored data sets so that all calculated columns appeared last. PROC COMPARE cares only about the order of the rows and not the columns. So, prior to sorting the original data set, I got a list of the variables in the refactor and sorted the original data set using that. This approach generalizes across all data sets, regardless of what the columns actually are.

%macro   Validate(dataset);
**********************************
*** Import
**********************************;  
  data live_&dataset.;
    set inFinal.&dataset.;
  run;

**********************************
*** Arrange for PROC COMPARE
**********************************;
  proc sort data = &dataset. out = validation_refactor;
    by _all_;
  run;

  proc sql noprint;
    select name
    into : refactorVariableOrder
    separated by ' '
    from dictionary.columns
    where libname = 'WORK'
      and memname = "%upcase(&dataset.)"
    ;
  quit;

  proc sort data = live_&dataset. out = validation_original;
    by &refactorVariableOrder.;
  run;

**********************************
*** Validate
**********************************;
  proc compare error note 
    base      = validation_original
    compare   = validation_refactor
    ;
  run;
%mend;

Upvotes: 2

Related Questions