Reputation: 13
i have a string variable named cat_d containing some observations with duplicated words in each observation. how can i remove the duplicated words for each observations? The following shows and image of the variable and data at the link variable cat_d
sample data each observations:
MPSJ,Hulu Langat,Hulu Langat, MPAJ, MPSJ, MPAJ, Gombak, MPSJ, MPSJ, MPSJ, MPKJ, MPAJ,MPAJ,Gombak,MPAJ,MPSJ,Hulu Langat,Gombak
Cheras,Cheras,Cheras,Setapak,Setapak,Setapak,Setapak,Pusat Bandar,Pusat Bandar,Klang Lama
Kuantan
MPJBT,MBJB,MBJB,MPPG,MBJB,MBJB,MBJB
expected output:
MPSJ,Hulu Langat,MPAJ,Gombak, MPKJ
Cheras,Setapak,Pusat Bandar,Klang Lama
Kuantan
MPJBT,MBJB,MPPG
data keep;
i=2;
length word $500;
do until (last.cat_d);
set want;
by cat_d notsorted;
string=cat_d;
do while(scan(string, i, ',') ^= '');
word = scan(string, i, ',');
do j = 1 to i - 1;
if word = scan(string, j, ',') then do;
start = findw(string, word, ',', findw(string, word, ',', 't') + 1, 't');
string = cat(substr(string, 1, start - 2), substr(string, start + length(word)));
leave;
end;
end;
i = i + 1;
end;
end;
keep cat_d string;run;
Upvotes: 1
Views: 956
Reputation: 21264
If you want the approach above to work you should try TRANWRD to remove the words, but you also have to deal with the commas and make sure to remove them if necessary. What happens to the last one that doesn't have a comma after it as well?
Here's an entirely different approach but its more flexible in my opinion.
Transpose it back to a wide data set and recreate the sentence.
*Create sample data;
data have;
length x $200.;
x="MPSJ,Hulu Langat,Hulu Langat, MPAJ, MPSJ, MPAJ, Gombak, MPSJ, MPSJ, MPSJ, MPKJ, MPAJ,MPAJ,Gombak,MPAJ,MPSJ,Hulu Langat,Gombak";
output;
x="Cheras,Cheras,Cheras,Setapak,Setapak,Setapak,Setapak,Pusat Bandar,Pusat Bandar,Klang Lama";
output;
x="Kuantan";
output;
x="MPJBT,MBJB,MBJB,MPPG,MBJB,MBJB,MBJB";
output;
run;
*Make it into a long dataset;
data long;
set have;
nwords=countw(x);
ID=_n_;
do i=1 to nwords;
words=scan(x, i);
output;
end;
run;
*Sort and remove duplicate values;
proc sort data=long nodupkey out=long_unique;
by ID words;
run;
*Transpose to a wide format;
proc transpose data=long_unique out=wide_unique prefix=word;
by id;
var words;
run;
*Make it back into one variable;
data want;
set wide_unique;
by id;
sentence=catx(", ", of word:);
run;
Upvotes: 2