user16202411
user16202411

Reputation:

How can i scrape invalid html using php simple dom?

I'm trying to scrape a webpage using phpsimpledom.

$html = '<div class="namepageheader"> 
            <div class="u">Name: <a href="someurl">Noor Shaad</a>
            <div class="u">Age: </div>
        </div> ' 
$name=$html->find('div[class="u"]', 0)->innertext;
$age=$html->find('div[class="u"]', 1)->innertext;

I tried my best to get text from each class="u" but it didn't work because there is missing closing tag </div> on first tag <div class="u">. Can anyone help me out with that....

Upvotes: 0

Views: 162

Answers (1)

sama latifi
sama latifi

Reputation: 86

You can find an element close to where the tag should have been closed and then standardize the html by replacing it. For example, you can replace the </a> tag by </a></div>.

str_replace('</a>','</a></div>',$html)

or if there are too many closed </a> tags , replace </a><div class="u"> with </a></div><div class="u">

str_replace('</a><div class="u">','</a></div><div class="u">',$html)

There may be another problem. There is a gap between the tags and the replacement does not work properly. To solve this problem, you can first delete the spaces between the tags and then replace them.

$html = '<div class="namepageheader"> 
            <div class="u">Name: <a href="someurl">Noor Shaad</a>
            <div class="u">Age: </div>
        </div> ' ;
$html = preg_replace('~>\\s+<~m', '><', $html);
str_replace('</a><div class="u">','</a></div><div class="u">',$html);
$name=$html->find('div[class="u"]', 0)->innertext;
$age=$html->find('div[class="u"]', 1)->innertext;

Upvotes: 1

Related Questions