JWW
JWW

Reputation: 13

SAS - Remove duplicated words in string for one whole variable

i have a string variable named cat_d containing some observations with duplicated words in each observation. how can i remove the duplicated words for each observations? The following shows and image of the variable and data at the link variable cat_d

sample data each observations:

MPSJ,Hulu Langat,Hulu Langat, MPAJ, MPSJ, MPAJ, Gombak, MPSJ, MPSJ, MPSJ, MPKJ, MPAJ,MPAJ,Gombak,MPAJ,MPSJ,Hulu Langat,Gombak

Cheras,Cheras,Cheras,Setapak,Setapak,Setapak,Setapak,Pusat Bandar,Pusat Bandar,Klang Lama

Kuantan

MPJBT,MBJB,MBJB,MPPG,MBJB,MBJB,MBJB

expected output:

MPSJ,Hulu Langat,MPAJ,Gombak, MPKJ

Cheras,Setapak,Pusat Bandar,Klang Lama

Kuantan

MPJBT,MBJB,MPPG

data keep;
i=2;
length word $500;
do until (last.cat_d);
    set want;
    by cat_d notsorted;
    string=cat_d;
    do while(scan(string, i, ',') ^= '');
        word = scan(string, i, ',');
        do j = 1 to i - 1;
            if word = scan(string, j, ',') then do;
                start = findw(string, word, ',', findw(string, word, ',', 't') + 1, 't');
                string = cat(substr(string, 1, start - 2), substr(string, start + length(word)));
                leave;
            end;
        end;
        i = i + 1;
    end;
end;
keep cat_d string;run;

Upvotes: 1

Views: 956

Answers (1)

Reeza
Reeza

Reputation: 21264

If you want the approach above to work you should try TRANWRD to remove the words, but you also have to deal with the commas and make sure to remove them if necessary. What happens to the last one that doesn't have a comma after it as well?

Here's an entirely different approach but its more flexible in my opinion.

  1. Count how many words in each variable
  2. Separate it out, so that each entry is on it's own line. In general, you may find this structure easier to work with overall.
  3. Sort and de-duplicate the data set
  4. Transpose it back to a wide data set and recreate the sentence.

    *Create sample data;
    
    data have;
        length x $200.;
        x="MPSJ,Hulu Langat,Hulu Langat, MPAJ, MPSJ, MPAJ, Gombak, MPSJ, MPSJ, MPSJ, MPKJ, MPAJ,MPAJ,Gombak,MPAJ,MPSJ,Hulu Langat,Gombak";
        output;
        x="Cheras,Cheras,Cheras,Setapak,Setapak,Setapak,Setapak,Pusat Bandar,Pusat Bandar,Klang Lama";
        output;
        x="Kuantan";
        output;
        x="MPJBT,MBJB,MBJB,MPPG,MBJB,MBJB,MBJB";
        output;
    run;
    
    *Make it into a long dataset;
    
    data long;
        set have;
        nwords=countw(x);
        ID=_n_;
    
        do i=1 to nwords;
            words=scan(x, i);
            output;
        end;
    run;
    
    *Sort and remove duplicate values;
    
    proc sort data=long nodupkey out=long_unique;
        by ID words;
    run;
    
    *Transpose to a wide format;
    
    proc transpose data=long_unique out=wide_unique prefix=word;
        by id;
        var words;
    run;
    
    *Make it back into one variable;
    
    data want;
        set wide_unique;
        by id;
        sentence=catx(", ", of word:);
    run;
    

Upvotes: 2

Related Questions