desmond.carros
desmond.carros

Reputation: 362

How to eliminate duplicate entries from SAS dataset?

There is a sample dataset provided to me in CSV format. the dummy dataset is as follows:

Baseball1,Baseball2
USA,France
USA,Italy
USA,England
England,USA
England,Australia
England,Sri Lanka
France,USA
France,England
France,Italy
Italy,USA
Italy,France
Italy,England

i need to get an output data in which the data has descrete values only. desired output is like:

Baseball1   Baseball2
  USA        France
  USA        Italy
  USA       England
 England    Australia
 England    Sri Lanka
 France     England
 France      Italy
 Italy      England

i think that PROC SQL might work here but i am not sure of how to remove duplicate entries of different columns.

Upvotes: 1

Views: 584

Answers (1)

Vasilij Nevlev
Vasilij Nevlev

Reputation: 1449

I think the tricky part is that you care about the horizontal order of variables, so for you France/Italy and Italy/France actually forms a duplicate that you want to remove.

See my code below with notes on what it does:

/* Reading data in */
data have; 
length baseball1 $ 9 baseball2 $ 9; 
infile datalines delimiter=',';
input Baseball1 $ Baseball2 $ ; 
datalines; 
USA,France
USA,Italy
USA,England
England,USA
England,Australia
England,Sri Lanka
France,USA
France,England
France,Italy
Italy,USA
Italy,France
Italy,England
;

/* horizontal sorting */
data sorted_arrays; 
set have; 
length Team1 $ 9 Team2 $ 9; 
/* Copying data into new vars to preserve original data for output */
Team1 = Baseball1;
Team2 = Baseball2;
/* Sorting data horizontally with sortc call */
call sortc(Team1,Team2); 
/* Creating an ID by concatenating sorted variables */ 
ID = (CATX("/",Team1,Team2)); 
/* Preserving original order */ 
order = _N_; 
run; 

/* Removing duplicates by ID and keeping required variables*/ 
PROC SORT data=sorted_arrays out=no_dupes(keep=baseball1 baseball2 order) NODUPKEY; 
BY ID; 
RUN;

/* Returning to original order to achieve the result needed */
PROC SORT data=no_dupes out=want(drop=order); 
by order; 
run;  

/* Final Report*/
PROC PRINT data=want;
RUN; 

Result:

enter image description here

If final horizontal/vertical order of variables doesn't matter, you can simplify code as follows and you can use PROC SQL:

/* Reading data in */
data have; 
length baseball1 $ 9 baseball2 $ 9; 
infile datalines delimiter=',';
input Baseball1 $ Baseball2 $ ; 
/* horizontal sorting */
call sortc(Baseball1,Baseball2); 
datalines; 
USA,France
USA,Italy
USA,England
England,USA
England,Australia
England,Sri Lanka
France,USA
France,England
France,Italy
Italy,USA
Italy,France
Italy,England
;

/*Remove dupes */
PROC SQL;
   CREATE TABLE want AS 
   SELECT DISTINCT t1.baseball1, 
          t1.baseball2
      FROM WORK.HAVE t1;
QUIT;


/* Final Report*/
PROC PRINT data=want;
RUN;

Result:

enter image description here

/* Reading data in */
data have (drop=tmp);
    length baseball1 $ 9 baseball2 $ 9 tmp $9;
    infile datalines delimiter=',';
    input Baseball1 $ Baseball2 $;

    /* horizontal sorting */
    if Baseball1>Baseball2 then
        do;
            tmp = Baseball1;
            Baseball1=Baseball2;
            Baseball2 = tmp;
        end;

    datalines;
USA,France
USA,Italy
USA,England
England,USA
England,Australia
England,Sri Lanka
France,USA
France,England
France,Italy
Italy,USA
Italy,France
Italy,England
;

/*Remove dupes */
PROC SQL;
    CREATE TABLE want AS 
        SELECT DISTINCT t1.baseball1, 
            t1.baseball2
        FROM WORK.HAVE t1;
QUIT;

/* Final Report*/
PROC PRINT data=want;
RUN;

Same result as in previous example:

enter image description here

Upvotes: 2

Related Questions