Reputation: 4534
In refactoring some code, I am trying to replicate various data sets. I am confirming the refactors in the following manner. I import the live/original data set, sort it by _all_
. I sort the refactored data set by _all_
. I then compare the two.
%macro Validate(dataset);
data live_&dataset.;
set inFinal.&dataset.;
run;
proc sort data = live_&dataset. out = validation_original;
by _all_;
run;
proc sort data = &dataset. out = validation_refactor;
by _all_;
run;
proc compare error note
base = validation_original
compare = validation_refactor
;
run;
%mend;
This works but for one bizarre case. The data set contains a begin_date
and end_date
. PROC COMPARE
throws an error on the these values. When I investigate, it happens that there exist two rows containing identical values but for begin_date
and end_date
. The original dataset is not sorted correctly, despite having been run through PROC SORT
.
Refactored Dataset
|-----+-----------+-------------+------+------------+------------+-------+-------+-------|
| id | numerator | denominator | rate | begin_date | end_date | junk1 | junk2 | junk3 |
|-----+-----------+-------------+------+------------+------------+-------+-------+-------|
| 123 | 0 | 0.41504 | 0 | 10/01/2015 | 10/31/2015 | ABC | XYZ | IOU |
|-----+-----------+-------------+------+------------+------------+-------+-------+-------|
| 123 | 0 | 0.41504 | 0 | 11/01/2015 | 11/30/2015 | ABC | XYZ | IOU |
|-----+-----------+-------------+------+------------+------------+-------+-------+-------|
Original Dataset
|-----+-----------+-------------+------+------------+------------+-------+-------+-------|
| id | numerator | denominator | rate | begin_date | end_date | junk1 | junk2 | junk3 |
|-----+-----------+-------------+------+------------+------------+-------+-------+-------|
| 123 | 0 | 0.41504 | 0 | 11/01/2015 | 11/30/2015 | ABC | XYZ | IOU |
|-----+-----------+-------------+------+------------+------------+-------+-------+-------|
| 123 | 0 | 0.41504 | 0 | 10/01/2015 | 10/31/2015 | ABC | XYZ | IOU |
|-----+-----------+-------------+------+------------+------------+-------+-------+-------|
I cannot account for why this is. The mis-sort only happens with this particular data set. When I examine the contents via PROC CONTENTS
they are identical: they have the same number of observations, the same type, the same len, the same format, the same sort order, the same encoding. Everything appears to be identical.
PROC SORT
?Upvotes: 1
Views: 236
Reputation: 4534
The data set is not sorting as expected because of "denominator". When the sort is performed by all variables except "denominator", the data sets sort as expected and compare as identical.
proc sort data = live_&dataset. out = validation_original;
by
id
numerator
rate
begin_date
end_date
junk1
junk2
junk3
;
run;
The method of calculating "denominator" differs between the original and refactor. This likely introduces a slight bit of representation error.
To implement the validation, I did the following. I rearranged the refactored data sets so that all calculated columns appeared last. PROC COMPARE
cares only about the order of the rows and not the columns. So, prior to sorting the original data set, I got a list of the variables in the refactor and sorted the original data set using that. This approach generalizes across all data sets, regardless of what the columns actually are.
%macro Validate(dataset);
**********************************
*** Import
**********************************;
data live_&dataset.;
set inFinal.&dataset.;
run;
**********************************
*** Arrange for PROC COMPARE
**********************************;
proc sort data = &dataset. out = validation_refactor;
by _all_;
run;
proc sql noprint;
select name
into : refactorVariableOrder
separated by ' '
from dictionary.columns
where libname = 'WORK'
and memname = "%upcase(&dataset.)"
;
quit;
proc sort data = live_&dataset. out = validation_original;
by &refactorVariableOrder.;
run;
**********************************
*** Validate
**********************************;
proc compare error note
base = validation_original
compare = validation_refactor
;
run;
%mend;
Upvotes: 2