Sarah Hailey
Sarah Hailey

Reputation: 494

SAS how to extract multiple words from a string

I have a list of multiple-length degree programs that include the degree type (e.g., PhD) and I want to delete the degree type and keep just the program name. e.g.:

Master of Science in Building Performance and Diagnostics
Master of Science in Computational Design  
Master of Science in Sustainable Design 
Master of Urban Design 
PhD in Architecture 

I am trying to use scan to split the string on "in " and extract all text that follows, but I don't understand the result I'm getting. When I use -1 (start from the right) as the starting point I get:

data want; 
    format new_prog old_prog $200.; 
    set have (rename = (program = old_prog)); 
    if count(old_prog, " in ") ge 1 then new_prog = scan(old_prog, -1, "in "); 
run; 


new_prog  old_prog
tecture   Master of Science in Architecture 
g         Master of Science in Sustainable Design 
cs        Master of Science in Building Performance and Diagnostics 
t         Master of Science in Architecture-Engineering and Construction Management 

I don't think this would work anyway since I want the entire string after "in " and not just the next word, but even if I use scan(old_prog, 2, "in") I expect this to give me the next word but it seems to be giving me random stuff, e.g.:

program  old_prog 
Bu       PhD in Building Performance and Diagnostics 
of       Master of Science in Architecture-Engineering and Construction Management 
Computat PhD in Computational Design 
of       Master of Science in Sustainable Design 

Upvotes: 0

Views: 7730

Answers (4)

Vasilij Nevlev
Vasilij Nevlev

Reputation: 1449

You have a number of valid options suggested by others. Could I suggest a REGEX way of getting what you want?

I have noticed three patterns in your sample data:

  1. Typical separator is "in " that you attempt to use in your code sample
  2. When typical separator is not used, then another separator "of " is being used.
  3. The degree type can be spelled differently (Master of Scinence,Master of Science, PhD).

When dealing with pattern in text, REGEX is very useful because you can define text patter that you are looking for and extract text when pattern is true.

See the comments in code for more information:

/* Dropping pattern ids because they are not useful in data */
data have (drop=pattern_in pattern_of);
    /* Reading in the raw data from datalines */
    input @1 old_prog $60.;

    /* Compiling first sample based on "in " pattern. */
    pattern_in = prxparse('/in ([\w\s]*)/');

    /* Compiling first sample based on "of " pattern. */
    pattern_of = prxparse('/of ([\w\s]*)/');

    /*If the string satisfied the patter with "in " */
    if prxmatch(pattern_in,old_prog) then 
    /* Then extract capture buffer after "in " pattern */
    new_prog=prxposn(pattern_in,1,old_prog);

    /*If the string satisfied the patter with "of " after it didn't find patter "in "*/
    else if prxmatch(pattern_of,old_prog) then 
    /* Then extract capture buffer after "of " pattern */
    new_prog=prxposn(pattern_of,1,old_prog);
    datalines;
Master of Scinence in Building Performance and Diagnostics
Master of Science in Computational Design 
Master of Science in Sustainable Design 
Master of Urban Design 
PhD in Architecture 
;

PROC PRINT DATA=have;
run;

Result: enter image description here

Upvotes: 0

Parul
Parul

Reputation: 21

data have;
input @1 old_prog $60.;
if find(old_prog, ' in ') then new_prog = substr(old_prog, 1, find(old_prog, ' in '));
else new_prog= old_prog;
datalines;
Master of Scinence in Building Performance and Diagnostics
Master of Science in Computational Design
Master of Science in Sustainable Design
Master of Urban Design
PhD in Architecture
;
run;
proc print data=have;
run;

Obs old_prog new_prog
1 Master of Scinence in Building Performance and Diagnostics Master of Scinence
2 Master of Science in Computational Design Master of Science
3 Master of Science in Sustainable Design Master of Science
4 Master of Urban Design Master of Urban Design
5 PhD in Architecture PhD

Upvotes: 2

Parfait
Parfait

Reputation: 107767

Consider a data step and proc sql solution using the substr and index functions:

data want;
    set have;
    if count(old_prog, " in ") ge 1 
       then new_prog = substr(old_prog, index(old_prog, "in")+3);
run;


proc sql;
    create table want as
    select *, 
    case when index(old_prog, "in") > 0 
         then substr(old_prog, index(old_prog, "in")+3)
         else old_prog
    end as new_prog
    from want;
run;

Upvotes: 0

RamB
RamB

Reputation: 428

Here is how to do it using substr and index.

data want;
format new_prog old_prog $200.;
infile datalines dsd missover;
input old_prog :$200.;

if count(old_prog, " in ") ge 1 then new_prog = substr(old_prog,index(old_prog,"in") + 3); 

datalines;
Master of Science in Building Performance and Diagnostics
Master of Science in Computational Design  
Master of Science in Sustainable Design 
Master of Urban Design 
PhD in Architecture 
;
run;

Index will find the position of "in " in the string and pass it to substr to start cutting the variable from this position + 3 to the end of the string.

Upvotes: 0

Related Questions