Reputation: 494
I have a list of multiple-length degree programs that include the degree type (e.g., PhD) and I want to delete the degree type and keep just the program name. e.g.:
Master of Science in Building Performance and Diagnostics
Master of Science in Computational Design
Master of Science in Sustainable Design
Master of Urban Design
PhD in Architecture
I am trying to use scan to split the string on "in " and extract all text that follows, but I don't understand the result I'm getting. When I use -1 (start from the right) as the starting point I get:
data want;
format new_prog old_prog $200.;
set have (rename = (program = old_prog));
if count(old_prog, " in ") ge 1 then new_prog = scan(old_prog, -1, "in ");
run;
new_prog old_prog
tecture Master of Science in Architecture
g Master of Science in Sustainable Design
cs Master of Science in Building Performance and Diagnostics
t Master of Science in Architecture-Engineering and Construction Management
I don't think this would work anyway since I want the entire string after "in " and not just the next word, but even if I use scan(old_prog, 2, "in") I expect this to give me the next word but it seems to be giving me random stuff, e.g.:
program old_prog
Bu PhD in Building Performance and Diagnostics
of Master of Science in Architecture-Engineering and Construction Management
Computat PhD in Computational Design
of Master of Science in Sustainable Design
Upvotes: 0
Views: 7730
Reputation: 1449
You have a number of valid options suggested by others. Could I suggest a REGEX way of getting what you want?
I have noticed three patterns in your sample data:
When dealing with pattern in text, REGEX is very useful because you can define text patter that you are looking for and extract text when pattern is true.
See the comments in code for more information:
/* Dropping pattern ids because they are not useful in data */
data have (drop=pattern_in pattern_of);
/* Reading in the raw data from datalines */
input @1 old_prog $60.;
/* Compiling first sample based on "in " pattern. */
pattern_in = prxparse('/in ([\w\s]*)/');
/* Compiling first sample based on "of " pattern. */
pattern_of = prxparse('/of ([\w\s]*)/');
/*If the string satisfied the patter with "in " */
if prxmatch(pattern_in,old_prog) then
/* Then extract capture buffer after "in " pattern */
new_prog=prxposn(pattern_in,1,old_prog);
/*If the string satisfied the patter with "of " after it didn't find patter "in "*/
else if prxmatch(pattern_of,old_prog) then
/* Then extract capture buffer after "of " pattern */
new_prog=prxposn(pattern_of,1,old_prog);
datalines;
Master of Scinence in Building Performance and Diagnostics
Master of Science in Computational Design
Master of Science in Sustainable Design
Master of Urban Design
PhD in Architecture
;
PROC PRINT DATA=have;
run;
Upvotes: 0
Reputation: 21
data have;
input @1 old_prog $60.;
if find(old_prog, ' in ') then new_prog = substr(old_prog, 1, find(old_prog, ' in '));
else new_prog= old_prog;
datalines;
Master of Scinence in Building Performance and Diagnostics
Master of Science in Computational Design
Master of Science in Sustainable Design
Master of Urban Design
PhD in Architecture
;
run;
proc print data=have;
run;
Obs old_prog new_prog
1 Master of Scinence in Building Performance and Diagnostics Master of Scinence
2 Master of Science in Computational Design Master of Science
3 Master of Science in Sustainable Design Master of Science
4 Master of Urban Design Master of Urban Design
5 PhD in Architecture PhD
Upvotes: 2
Reputation: 107767
Consider a data step and proc sql solution using the substr and index functions:
data want;
set have;
if count(old_prog, " in ") ge 1
then new_prog = substr(old_prog, index(old_prog, "in")+3);
run;
proc sql;
create table want as
select *,
case when index(old_prog, "in") > 0
then substr(old_prog, index(old_prog, "in")+3)
else old_prog
end as new_prog
from want;
run;
Upvotes: 0
Reputation: 428
Here is how to do it using substr and index.
data want;
format new_prog old_prog $200.;
infile datalines dsd missover;
input old_prog :$200.;
if count(old_prog, " in ") ge 1 then new_prog = substr(old_prog,index(old_prog,"in") + 3);
datalines;
Master of Science in Building Performance and Diagnostics
Master of Science in Computational Design
Master of Science in Sustainable Design
Master of Urban Design
PhD in Architecture
;
run;
Index will find the position of "in " in the string and pass it to substr to start cutting the variable from this position + 3 to the end of the string.
Upvotes: 0