Reputation: 95
I am working with a very large dataset containing the same columns several times, but with different column names (both character and numeric). Does anyone know how to find and delete these identical columns?
Example
A B C D E F G
12 ab 12 ab 8 h 12
14 cd 14 cd 65 j 14
6 fs 6 fs 3 g 6
. . . . 4 q .
3 d 3 d 5 d 3
A-G
are variable names, and I want to be able to see that A
, C
and G
are identical and then remove all except one.
Also B
and D
are identical. I want to keep only one.
Is this even possible?
Upvotes: 4
Views: 1132
Reputation: 51566
Here is example using technique proposed by Shenglin Chen in the comments.
data have ;
input A B $ C D $ E F $ G ;
cards;
12 ab 12 ab 8 h 12
14 cd 14 cd 65 j 14
6 fs 6 fs 3 g 6
. . . . 4 q .
3 d 3 d 5 d 3
;;;;
Find the unique numeric columns.
proc transpose data=have out=tall_numbers ;
var _numeric_;
run;
proc sort data=tall_numbers nodupkey out=keep_numbers(keep=_name_);
by col: ;
run;
Find the unique character columns.
proc transpose data=have out=tall_characters ;
var _character_;
run;
proc sort data=tall_characters nodupkey out=keep_characters(keep=_name_);
by col: ;
run;
Get the combined list of columns.
proc sql noprint ;
select _name_
into :keep_list separated by ' '
from (select _name_ from keep_characters
union select _name_ from keep_numbers)
order by 1
;
quit;
Make new table with only the unique columns.
data want ;
set have ;
keep &keep_list ;
run;
Upvotes: 2