Reputation: 1031
messy
is a SAS character variable containing a list of papers an author cited in their own work.
Here is one observation of messy
.
(label:1;name:Azad, Meghan B;pubyear:2008;volume:4;issue:2;pagenum:195;refwork:Autophagy;collkey:2008051953)(label:2;name:Bai, Jing;pubyear:2012;volume:39;issue:3;pagenum:2697;refwork:Mol Biol Rep;collkey:2012197491)
This record includes 2 references - one that begins at "(label:1;"
and another that begins at "(label:2;"
.
I need to create character variables that return part of the content after "name:" for each reference. For this observation, that would look like this:
clean1 clean2
AZAD.MEGHAN BAI.JING
I attempt to do this with the scan()
function in a data step as follows:
data ex2;
length lastname1-lastname10
lastname1-lastname10
clean1-clean10 $100; /*initializes empty variables for storing up to 10 names*/
set ex;
array lastname {*} lastname1-lastname10;
array firstname {*} firsttname1-firstname10;
array clean {*} clean1-clean10;
do i=1 to count(messy, "label:"); /*loop through messy as many times as there are names*/
lastname{i} = scan(messy, 1, "name:"); /*pick up first word after name*/
firstname{i} = scan(messy, 2, "name:"); /*pick up second word after name*/
clean{i} = cats(upcase(lastname{i}), ".", upcase(firstname{i}));
end;
run;
There are (at least) two issues here:
scan()
(the content after the first occurrence of "name:") to the variables in the lastname and firstname arrays.scan()
itself works! I thought the third argument allowed me to specify a delimiter of my choosing, but the results of scan(messy, 1, "name:");
returns "(l"
instead of "AZAD"
as I expected.Specific Ask:
How can I pick up all names in the messy
variable and store them separately as clean1
, clean2
, etc?
Upvotes: 0
Views: 79
Reputation: 1427
I would use PRX* for such things
data ex2;
length clean1-clean10 $100;
set ex;
array clean {*} clean1-clean10;
if _N_=1 then ExpressionID+prxparse('/[(;]name:([^;)]+)[;)]/');
start = 1;
stop = length(messy);
call prxnext(ExpressionID, start, stop, messy, position, length);
i=0;
do while (position > 0);
i+1;
clean{i} = prxposn(ExpressionID, 1, messy);
call prxnext(ExpressionID, start, stop, messy, position, length);
end;
run;
Upvotes: 1