Reputation: 5913
I have the following function defined via PROC FCMP
. The point of the code should be pretty obvious and relatively straightforward. I'm returning the value of an attribute from a line of XHTML. Here's the code:
proc fcmp outlib=library.funcs.crawl;
function getAttr(htmline $, Attribute $) $;
/*-- Find the position of the match --*/
Pos = index( htmline , strip( Attribute )||"=" );
/*-- Now do something about it --*/
if pos > 0 then do;
Value = scan( substr( htmline, Pos + length( Attribute ) + 2), 1, '"');
end;
else Value = "";
return( Value);
endsub;
run;
No matter what I do with length or attrib
statement to try to explicitly declare the data type returned, it ALWAYS returns only a max of 33 bytes of the requested string, regardless of how long the actual return value is. This happens no matter which attribute I am searching for. The same code (hard-coded) into a data step returns the correct results so this is related to PROC FCMP
.
Here is the datastep I'm using to test it (where PageSource.html is any html file that has xhtml compliant attributes -- fully quoted):
data TEST;
length href $200;
infile "F:\PageSource.html";
input;
htmline = _INFILE_;
href = getAttr( htmline, "href");
x = length(href);
run;
UPDATE: This seems to work properly after upgrading to SAS9.2 - Release 2
Upvotes: 3
Views: 1351
Reputation: 21
It seems like uninitialized variables in PROC FCMP get a default length of 33 bytes. Consider the following demonstration code:
OPTIONS INSERT = (CMPLIB = WORK.FCMP);
PROC FCMP
OUTLIB = WORK.FCMP.FOO
;
FUNCTION FOO(
BAR $
);
* Assign the value of BAR to the uninitialised variable BAZ;
BAZ = BAR;
* Diagnostics;
PUT 'BAR IS ' BAR;
PUT 'BAZ IS ' BAZ;
* Return error code;
IF
LENGTH(BAZ) NE LENGTH(BAR)
THEN
RETURN(0)
; ELSE
RETURN(1)
;
ENDSUB;
RUN;
DATA _NULL_;
X = 'shortstring';
Y = 'exactly 33 characters long string';
Z = 'this string is somewhat longer than 33 characters';
ARRAY STRINGS{*} _CHARACTER_;
ARRAY RC{3} 8 _TEMPORARY_;
DO I = 1 TO DIM(STRINGS);
RC[I] = FOO(STRINGS[I]);
END;
RUN;
Which, with my site's installation (Base SAS 9.4 M2) prints the following lines to the log:
BAR IS shortstring
BAZ IS shortstring
BAR IS exactly 33 characters long string
BAZ IS exactly 33 characters long string
BAR IS this string is somewhat longer than 33 characters
BAZ IS this string is somewhat longer th
This is likely related to the fact that PROC FCMP, like DATA steps, cannot allocate variable lengths dynamically at runtime. However, it's a little confusing, because it does dynamically allocate variable lengths for parameters. I'm assuming that there is a separate "initialization" phase for PROC FCMP subroutines, during which the length of values passed as arguments are determined and parameter variables which must hold those values are initialized to the required length. However, the length of variables defined only within the body of the subroutine can only be discovered at runtime, when memory has already been allocated. So prior to runtime (whether at compile-time or my hypothetical "initialization" phase), memory is allocated to these variables with an explicit LENGTH statement if present, and otherwise falls back to a default of 33 bytes.
Now what's really interesting is that PROC FCMP is as smart as can be about this -- within the strict separation of initialization/runtime stages. If, in the body of the subroutine, a variable A
has an explicitly defined LENGTH, and then another uninitialized variable B
is assigned a function of A
, then B
is set to the same length as A
. Consider this modification of the above function, in which the value of BAR
is not assigned directly to BAZ
, but rather via the third variable QUX
, which has an explicitly defined LENGTH
of 50 bytes:
OPTIONS INSERT = (CMPLIB = WORK.FCMP);
PROC FCMP
OUTLIB = WORK.FCMP.FOO
;
FUNCTION FOO(
BAR $
);
LENGTH QUX $ 50;
QUX = BAR;
* Assign the value of BAR to the uninitialised variable BAZ;
BAZ = QUX;
* Diagnostics;
PUT 'BAR IS ' BAR;
PUT 'BAZ IS ' BAZ;
* Return error code;
IF
LENGTH(BAZ) NE LENGTH(BAR)
THEN
RETURN(0)
; ELSE
RETURN(1)
;
ENDSUB;
RUN;
DATA _NULL_;
X = 'shortstring';
Y = 'exactly 33 characters long string';
Z = 'this string is somewhat longer than 33 characters';
ARRAY STRINGS{*} _CHARACTER_;
ARRAY RC{3} 8 _TEMPORARY_;
DO I = 1 TO DIM(STRINGS);
RC[I] = FOO(STRINGS[I]);
END;
RUN;
The log shows:
BAR IS shortstring
BAZ IS shortstring
BAR IS exactly 33 characters long string
BAZ IS exactly 33 characters long string
BAR IS this string is somewhat longer than 33 characters
BAZ IS this string is somewhat longer than 33 characters
It's likely that this "helpful" behavior is the cause of confusion and differences in the previous answers. I wonder if this behavior is documented?
I'll leave it as an exercise to the reader to investigate exactly how smart SAS tries to get about this. For example, if an uninitialized variable gets assigned the concatenated values of two other variables with explicitly assigned lengths, is its length set to the sum of those of the other two?
Upvotes: 2
Reputation: 4475
I think the problem (though I don't know why) is in the scan function - it seems to be truncating input from substr(). If you pull the substr function out of scan(), assign the result of the substr function to a new variable that you then pass to scan, it seems to work.
Here is what I ran:
proc fcmp outlib=work.funcs.crawl;
function getAttr(htmline $, Attribute $) $;
length y $200;
/*-- Find the position of the match --*/
Pos = index( htmline , strip( Attribute )||"=" );
/*-- Now do something about it --*/
if pos > 0 then do;
y=substr( htmline, Pos + length( Attribute ) + 2);
Value = scan( y, 1, '"');
end;
else Value = "";
return( Value);
endsub;
run;
options cmplib=work.funcs;
data TEST;
length href $200;
infile "PageSource.html";
input;
htmline = _INFILE_;
href = getAttr( htmline, "href");
x = length(href);
run;
Upvotes: 3
Reputation: 2307
In this case, an input pointer control should be enough. hope this helps.
/* create a test input file */
data _null_;
file "f:\pageSource.html";
input;
put _infile_;
cards4;
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="w3.org/StyleSheets/TR/W3C-REC.css"; type="text/css"?>
;;;;
run;
/* extract the href attribute value, if any. */
/* assuming that the value and the attribute name occurs in one line. */
/* and max length is 200 chars. */
data one;
infile "f:\pageSource.html" missover;
input @("href=") href :$200.;
href = scan(href, 1, '"'); /* unquote */
run;
/* check */
proc print data=one;
run;
/* on lst
Obs href
1
2 w3.org/StyleSheets/TR/W3C-REC.css
*/
Upvotes: 2
Reputation: 5913
I ended up backing out of using FCMP defined data step functions. I don't think they're ready for primetime. Not only could I not solve the 33 byte return issue, but it started regularly crashing SAS.
So back to the good old (decades old) technology of macros. This works:
/*********************************/
/*= Macro to extract Attribute =*/
/*= from XHTML string =*/
/*********************************/
%macro getAttr( htmline, Attribute, NewVar );
if index( &htmline , strip( &Attribute )||"=" ) > 0 then do;
&NewVar = scan( substr( &htmline, index( &htmline , strip( &Attribute )||"=" ) + length( &Attribute ) + 2), 1, '"' );
end;
%mend;
Upvotes: 0