Reputation: 362
There is a sample dataset provided to me in CSV format. the dummy dataset is as follows:
Baseball1,Baseball2
USA,France
USA,Italy
USA,England
England,USA
England,Australia
England,Sri Lanka
France,USA
France,England
France,Italy
Italy,USA
Italy,France
Italy,England
i need to get an output data in which the data has descrete values only. desired output is like:
Baseball1 Baseball2
USA France
USA Italy
USA England
England Australia
England Sri Lanka
France England
France Italy
Italy England
i think that PROC SQL might work here but i am not sure of how to remove duplicate entries of different columns.
Upvotes: 1
Views: 584
Reputation: 1449
I think the tricky part is that you care about the horizontal order of variables, so for you France/Italy and Italy/France actually forms a duplicate that you want to remove.
See my code below with notes on what it does:
/* Reading data in */
data have;
length baseball1 $ 9 baseball2 $ 9;
infile datalines delimiter=',';
input Baseball1 $ Baseball2 $ ;
datalines;
USA,France
USA,Italy
USA,England
England,USA
England,Australia
England,Sri Lanka
France,USA
France,England
France,Italy
Italy,USA
Italy,France
Italy,England
;
/* horizontal sorting */
data sorted_arrays;
set have;
length Team1 $ 9 Team2 $ 9;
/* Copying data into new vars to preserve original data for output */
Team1 = Baseball1;
Team2 = Baseball2;
/* Sorting data horizontally with sortc call */
call sortc(Team1,Team2);
/* Creating an ID by concatenating sorted variables */
ID = (CATX("/",Team1,Team2));
/* Preserving original order */
order = _N_;
run;
/* Removing duplicates by ID and keeping required variables*/
PROC SORT data=sorted_arrays out=no_dupes(keep=baseball1 baseball2 order) NODUPKEY;
BY ID;
RUN;
/* Returning to original order to achieve the result needed */
PROC SORT data=no_dupes out=want(drop=order);
by order;
run;
/* Final Report*/
PROC PRINT data=want;
RUN;
Result:
If final horizontal/vertical order of variables doesn't matter, you can simplify code as follows and you can use PROC SQL:
/* Reading data in */
data have;
length baseball1 $ 9 baseball2 $ 9;
infile datalines delimiter=',';
input Baseball1 $ Baseball2 $ ;
/* horizontal sorting */
call sortc(Baseball1,Baseball2);
datalines;
USA,France
USA,Italy
USA,England
England,USA
England,Australia
England,Sri Lanka
France,USA
France,England
France,Italy
Italy,USA
Italy,France
Italy,England
;
/*Remove dupes */
PROC SQL;
CREATE TABLE want AS
SELECT DISTINCT t1.baseball1,
t1.baseball2
FROM WORK.HAVE t1;
QUIT;
/* Final Report*/
PROC PRINT data=want;
RUN;
Result:
/* Reading data in */
data have (drop=tmp);
length baseball1 $ 9 baseball2 $ 9 tmp $9;
infile datalines delimiter=',';
input Baseball1 $ Baseball2 $;
/* horizontal sorting */
if Baseball1>Baseball2 then
do;
tmp = Baseball1;
Baseball1=Baseball2;
Baseball2 = tmp;
end;
datalines;
USA,France
USA,Italy
USA,England
England,USA
England,Australia
England,Sri Lanka
France,USA
France,England
France,Italy
Italy,USA
Italy,France
Italy,England
;
/*Remove dupes */
PROC SQL;
CREATE TABLE want AS
SELECT DISTINCT t1.baseball1,
t1.baseball2
FROM WORK.HAVE t1;
QUIT;
/* Final Report*/
PROC PRINT data=want;
RUN;
Same result as in previous example:
Upvotes: 2