Reputation: 4435
im using php and i need to scrape some information from some curl responses to a site. i am simulating both an ajax request by a browser and a normal (entire) page request by a browser, however the ajax response is slightly different to the entire page request in this section of the html.
the ajax response is:
<div id="accountProfile"><h2>THIS IS THE BIT I WANT</h2><dl id="accountProfileData">
however the normal response is:
<div id="accountProfile"><html xmlns="http://www.w3.org/1999/xhtml"><h2>THIS IS THE BIT I WANT</h2><dl id="accountProfileData">
ie the ajax response is missing the tag: <html xmlns="http://www.w3.org/1999/xhtml">
. i need to get the bits in between the h2
tags. obviously i can't just scrape the page for <h2>THIS IS THE BIT I WANT</h2><dl id="accountProfileData">
since these tags may occur in other places and not contain the information i want.
i can match either one of the patterns individually, however i would like to do both in a single regex. here is my solution for matching the ajax response:
<?php
$pattern = '/\<div id="accountProfile"\>\<h2\>(.+?)\<\/h2\>\<dl id="accountProfileData"\>/';
preg_match($pattern, $haystack, $matches);
print_r($matches);
?>
can someone show me how i should alter the pattern to optionally match the <html xmlns="http://www.w3.org/1999/xhtml">
tag aswell? if it helps to simplify the haystack for the purposes of brevity that's fine.
Upvotes: 4
Views: 379
Reputation: 151
I haven't tested it, but you can try this:
$pattern = '/\<div id="accountProfile"\>(\<html xmlns=\"http://www.w3.org/1999/xhtml\"\>){0,1}\<h2\>(.+?)\<\/h2\>\<dl id="accountProfileData"\>/';
Upvotes: 2