uwd
uwd

Reputation: 45

Is there a better way replace several words in a string with another word? SAS

I've got lots of data and one of the columns is a free text description. I'm trying to process this in SAS and as part of this I want to correct some spelling and remove some words that don't really add any value to what the text is saying (aka 'stopwords').

I've got a way of doing this which is the code shown below, and this is working fine, but it means I need to have a line of code for each and every variant of a word I want to change.

In the example below I want to:

  1. replace variants of the word "mug" (ie. "mug", "mugg", "mugs") with the work "cup"
  2. remove three words that aren't really adding anything to the meaning of the text (called "stopwords", here I've just listed 3: "i", "me", "my").

To get this working I've needed to actually replace words surrounded by spaces (ie. replace instances of " mug ", " mugg ", " mugs " rather than "mug", "mugg", "mugs"). This is to avoid replacing parts of others words that contain that string of letters. So I've had to remove punctuation and add a space to the start and end of the string of text, before doing the spell changes, which is fine.

I'm sure there must be a better way of doing this than the code below and I'm keen to improve my SAS, so does anyone know a better way of doing this. Is there a way of creating a new list comprised of "mug", "mugg", "mugs" and then doing the replacement of all these words with " cup " in a single line?

Any thoughts would be greatly appreciated :)

Code below:

data have;
  infile datalines dsd truncover;
  input ID Description :$50. Col3 $ Col4 Col5 Col6;
datalines;
1,bla bla my mybla,C1,0,100,0
2,got me tear,C1,0,0,0
3,free text i ,C1,10,100,0
4,house roof tree!?,C1,10,100,0
5,house mugg muggle,C1,10,0,0
6,sky** computer mug mug mugs!,C3,0,20,1
;
/* add a space to the start and end so every word is surounded by spaces */
data data_1;
set have;
Space = "_";
Description_new = catt(Space, Description, Space);
Description_new = tranwrd(Description_new,"_", " _ ");
run;

/* remove punctuation so every word is surounded by spaces */
data data_2;
set data_1;
Description_new = COMPRESS(Description_new,,'p');
drop Space;
run;

/* correct spelling of mug to cup*/
data data_3;
set data_2;
Description_new = tranwrd(Description_new," mug ", " cup ");
Description_new = tranwrd(Description_new," mugs ", " cup ");
Description_new = tranwrd(Description_new," mugg ", " cup ");
run;

/* remove stopwords */
data data_4;
set data_3;
Description_new = tranwrd(Description_new," i ", " ");
Description_new = tranwrd(Description_new," me ", " ");
Description_new = tranwrd(Description_new," my ", " ");
run;

Upvotes: 0

Views: 960

Answers (1)

Tom
Tom

Reputation: 51611

You could use a format to convert each word in the original variable.

data have;
  infile datalines dsd truncover;
  input ID Description :$50. Col3 $ Col4 Col5 Col6;
datalines;
1,bla bla my mybla,C1,0,100,0
2,got me tear,C1,0,0,0
3,free text i ,C1,10,100,0
4,house roof tree!?,C1,10,100,0
5,house mugg muggle,C1,10,0,0
6,sky** computer mug mug mugs!,C3,0,20,1
;

proc format ;
value $fix (max=200)
  "mug", "mugg", "mugs" = "cup"
  "i", "me", "my" = " "
;
run;

data want;
  set have;
  fixed=description;
  fixed=' ';
  do index=1 to countw(description,' ');
    fixed=catx(' ',fixed,put(scan(description,index,' '),$fix200.));
  end;
run;

Upvotes: 2

Related Questions