hossibley
hossibley

Reputation: 253

SAS PROC SQL Group by all variables to detect duplicates

I have a dataset that contains 250 variables. I think some rows may be exact duplicates. If I only had 3 variables, I could run this code to check for dupes:

proc sql;
  create table checkDupe as
  select count(*) as N, *
  from bigTable
  group by 1, 2, 3
  having N > 1;
quit;

However, with 250 variables, I don't want to type out group by 1, 2, 3, ... , 250.

The following group by statements don't work:

group by *

group by _ALL_

group by 1:250

Is there a concise way to group by all variables?

Upvotes: 1

Views: 7274

Answers (3)

Quentin
Quentin

Reputation: 6378

Your sample code checking for duplicates returns a dataset that has all of the records where the key is NOT unique. So the below returns four records:

data bigtable;
   input a b c d;
   datalines;
1 1 1 1
1 1 1 1
1 2 1 1
1 2 2 1
1 2 2 2
;
run;

proc sql;
  create table checkDupe1 as
  select *,count(*) as N
  from bigTable
  group by 1, 2, 3
  having N > 1;
quit;

Assuming you don't need the N column, they easiest way in SAS to get a dataset of all non-unique rows is to use PROC SORT with the NOUNIQUEKEY option. So below will return the same four records:

proc sort data=bigtable out=checkDupe2 nouniquekey ;
  by a b c ;
run ;

And the BY statement can be BY _ALL_, which will return the two records that are duplicates on all variables:

proc sort data=bigtable out=checkDupe3 nouniquekey ;
  by _all_ ;
run ;

Upvotes: 1

Len Greski
Len Greski

Reputation: 10855

If the desired output is each unique row from a SAS data set, no grouping is required. The easiest way to execute this in PROC SQL is with the distinct keyword.

For example:

data one;
   input a b c d;
   datalines;
   1 1 1 1
   1 1 1 1
   1 2 1 1
   1 2 3 4
   1 2 1 1
   ;
   run;
proc sql;
   select distinct * from one;
   run;

...produces the following output, where the two duplicate rows are eliminated.

enter image description here

Upvotes: 0

Shenglin Chen
Shenglin Chen

Reputation: 4554

Try to produce macro variable for column names of dataset, then group in sql;

proc sql;
   select name into: name separate ',' from dictionary.columns where libname='SASHELP' and memname='CLASS';
quit;

proc sql;
   select count(*) as N,* from sashelp.class group by &name;
quit;

If just want to know whether there are complete duplicates in dataset, try this:

proc sort data=sashelp.class out=want nodup;
by _all_;
run;

Upvotes: 3

Related Questions