Reputation: 9564
I've got a problem with regexp function, preg_replace(), in PHP. I want to get viewstate from html's input, but it doesn't work properly.
This code:
$viewstate = preg_replace('/^(.*)(<input\s+id="__VIEWSTATE"\s+type="hidden"\s+value=")(.*[^"])("\s+name="__VIEWSTATE">)(.*)$/u','^\${3}$',$html);
Returns this:
%0D%0A%0D%0A%3C%21DOCTYPE+html+PUBLIC+%22-%2F%2FW3C%2F%2FDTD+XHTML+1.0+Transitional%2F%2FEN%22+%22http%3A%2F%2Fwww.w3.org%2FTR%2Fxhtml1%2FDTD%2Fxhtml1-transitional.dtd%22%3E%0D%0A%0D%0A%3Chtml+xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxhtml%22+%3E%0D%0A%3Chead%3E%3Ctitle%3E%0D%0A%09Strava.cz%0D%0A%3C%2Ftitle%3E%3Clink+rel%3D%22shortcut+icon%22+href%3D%22..%2FGrafika%2Ffavicon.ico%22+type%3D%22image%2Fx-icon%22+%2F%3E%3Clink+rel%3D%22stylesheet%22+type%3D%22text%2Fcss%22+media%3D%22screen%22+href%3D%22..%2FStyly%2FZaklad.css%22+%2F%3E%0D%0A++++%3Cstyle+type%3D%22text%2Fcss%22%3E%0D%0A++++++++.style1%0D%0A++++++++%7B%0D%0A++++++++++++width%3A+47px%3B%0D%0A++++++++%7D%0D%0A++++++++.style2%0D%0A++++++++%7B%0D%0A++++++++++++width%3A+64px%3B%0D%0A++++++++%7D%0D%0A++++%3C%2Fstyle%3E%0D%0A%0D%0A%3Cscript+type%3D%22text%2Fjavascript%22%3E%0D%0A%0D%0A++var+_gaq+%3D+_gaq+%7C%7C+%5B%5D%3B%0D%0A++_gaq.push%28%5B
EDIT: Sorry, I left this question for a long time. Finally I used DOMDocument.
Upvotes: 0
Views: 474
Reputation: 9564
The main mistake was the use of funciton preg_replace
, witch returns the subject - neither the matched pattern nor the replacement. Thank you for your ideas and for the recommendation of DOMDocument. m93a
http://www.php.net/manual/en/function.preg-replace.php#refsect1-function.preg-replace-returnvalues
Upvotes: 0
Reputation: 7866
To be sure i'd split this match into two phases:
Because you cannot be certain what the attributes order in the element will be.
if(preg_match('/<input[^>]+name="__VIEWSTATE"[^>]*>/i', $input, $match))
$value = preg_replace('/.*value="([^"]*)".*/i', '$1', $match[0]);
And, of course, always consider DOM
and DOMXpath
over regex for parsing html/xml.
Upvotes: 2
Reputation: 1094
You should only capture when you're planning on using the data. So most () are obsolete in that regexp pattern. Not a cause for failure but I thought I'd mention it.
Instead of using [^"] to mark that you don't want that character you could use the non-greedy modifier - ?
. This makes sure the pattern is matching as little as it can. Since you have name="__VIEWSTATE"
following the value this should be safe.
Let's put this in practice and simplify the pattern some. This works as you want:
'/.*<input\s+id="__VIEWSTATE"\s+type="hidden"\s+value="(.+?)"\s+name="__VIEWSTATE">.*/'
I would strongly recommend checking out an alternative to regexp for DOM operations. This makes certain your code works also if the attributes changes order. Plus it's so much nicer to work with.
Upvotes: 1