Reputation: 45
I've got lots of data and one of the columns is a free text description. I'm trying to process this in SAS and as part of this I want to correct some spelling and remove some words that don't really add any value to what the text is saying (aka 'stopwords').
I've got a way of doing this which is the code shown below, and this is working fine, but it means I need to have a line of code for each and every variant of a word I want to change.
In the example below I want to:
To get this working I've needed to actually replace words surrounded by spaces (ie. replace instances of " mug ", " mugg ", " mugs " rather than "mug", "mugg", "mugs"). This is to avoid replacing parts of others words that contain that string of letters. So I've had to remove punctuation and add a space to the start and end of the string of text, before doing the spell changes, which is fine.
I'm sure there must be a better way of doing this than the code below and I'm keen to improve my SAS, so does anyone know a better way of doing this. Is there a way of creating a new list comprised of "mug", "mugg", "mugs" and then doing the replacement of all these words with " cup " in a single line?
Any thoughts would be greatly appreciated :)
Code below:
data have;
infile datalines dsd truncover;
input ID Description :$50. Col3 $ Col4 Col5 Col6;
datalines;
1,bla bla my mybla,C1,0,100,0
2,got me tear,C1,0,0,0
3,free text i ,C1,10,100,0
4,house roof tree!?,C1,10,100,0
5,house mugg muggle,C1,10,0,0
6,sky** computer mug mug mugs!,C3,0,20,1
;
/* add a space to the start and end so every word is surounded by spaces */
data data_1;
set have;
Space = "_";
Description_new = catt(Space, Description, Space);
Description_new = tranwrd(Description_new,"_", " _ ");
run;
/* remove punctuation so every word is surounded by spaces */
data data_2;
set data_1;
Description_new = COMPRESS(Description_new,,'p');
drop Space;
run;
/* correct spelling of mug to cup*/
data data_3;
set data_2;
Description_new = tranwrd(Description_new," mug ", " cup ");
Description_new = tranwrd(Description_new," mugs ", " cup ");
Description_new = tranwrd(Description_new," mugg ", " cup ");
run;
/* remove stopwords */
data data_4;
set data_3;
Description_new = tranwrd(Description_new," i ", " ");
Description_new = tranwrd(Description_new," me ", " ");
Description_new = tranwrd(Description_new," my ", " ");
run;
Upvotes: 0
Views: 960
Reputation: 51611
You could use a format to convert each word in the original variable.
data have;
infile datalines dsd truncover;
input ID Description :$50. Col3 $ Col4 Col5 Col6;
datalines;
1,bla bla my mybla,C1,0,100,0
2,got me tear,C1,0,0,0
3,free text i ,C1,10,100,0
4,house roof tree!?,C1,10,100,0
5,house mugg muggle,C1,10,0,0
6,sky** computer mug mug mugs!,C3,0,20,1
;
proc format ;
value $fix (max=200)
"mug", "mugg", "mugs" = "cup"
"i", "me", "my" = " "
;
run;
data want;
set have;
fixed=description;
fixed=' ';
do index=1 to countw(description,' ');
fixed=catx(' ',fixed,put(scan(description,index,' '),$fix200.));
end;
run;
Upvotes: 2