samach
samach

Reputation: 3394

get an element using php DomDocument

I have the following html and i am using php's DomDocument class to get the element with id 'nextPageBtn' next to the script tag. the problem is my query doesnot return anything (as if there is no element with the specified id). heres the html i am parsing.

<body>
    <div style='float:left'><img src='../../../../includes/ph1.jpg'></div>

    <label style='width: 476px; height: 40px; position: absolute;top:100px; left: 40px; z-index: 2; background-color: rgb(255, 255, 255);; background-color: transparent' >
    <font size="4">1a. Nice to meet you!</font>
    </label>
    <img src='ENG_L1_C1_P0_1.jpg' style='width: 700px; height: 540px; position: absolute;top:140px; left: 40px; z-index: 1;' />

    <script type='text/javascript'> 


    swfobject.registerObject('FlashID');
    </script>

    <input type="image" id="nextPageBtn" src="../../../../includes/ph4.gif" style="position: absolute; top: 40px; left: 795px; ">

    </body>

and heres the php code to parse it.

$doc->loadHTMLFile($path);

    $doc->encoding='UTF-8';
    $x = new DOMXPath($doc);
$nextPage=$x->query("//*[@id='nextPageBtn']")->item(0);
if($nextPage)
    {

    echo 'found it..';
}

I think the line 'swfobject.registerObject('FlashID')' is generating some kind of error which is avoiding the element to be found?

Upvotes: 1

Views: 341

Answers (1)

hakre
hakre

Reputation: 197767

As written in the comment, your code just works flawlessly. Demo: http://codepad.viper-7.com/RUNGOd

What you consider a source of problem:

I think the line 'swfobject.registerObject('FlashID')' is generating some kind of error which is avoiding the element to be found?

Hardly can be one as DOMDocument::loadHTMLFile should deal with all tags (otherwise you would have recieved errors/warnings in loading the document. After loading has been done, DOMDocument has normalized data accessible, so there aren't such issues (if there isn't a bug in libxml, the underlying library, but there hardly is for such a general thing).

So what are the options here? Probably the HTML is not the HTML you think of. That could be if loading the HTML fails in your case. Check for errors while loading:

error_reporting(~0); ini_set('display_errors', 1);

Also validate that the HTML is the HTML you think after loading:

$doc->loadHTMLFile($path);
echo $doc->saveHTML();

which will output the "source".

Also check your LIBXML version:

printf("LIBXML version: %s\n", LIBXML_DOTTED_VERSION);

LIBXML is the underlying library PHP's DOMDocument is based on. Depending on the version there can be bugs and not all features are working. For example the getElementById function doesn't work with loadHTMLFile/loadHTML with version 2.6.26 but it does with version 2.7.7 (the XPath expression you're using is not affected with these two versions).

If you're running into an encoding issue here (the source file has some other encoding than expected), it's harder to tell with the information you've provided. Internally DOMDocument's default encoding is UTF-8 in PHP, so setting:

 $doc->encoding='UTF-8';

after you've loaded the file looks superfluous to me. Maybe you should just remove this to reduce the code to easier find a place the error comes from (as I did in the demo).

Upvotes: 1

Related Questions