penjor
penjor

Reputation: 21

Can't scrape this site with Goutte Laravel Package. Elements exist in view source

I am building a scraping tool with the Laravel Goutte Package https://github.com/FriendsOfPHP/Goutte I have been able to scrape most websites until I came across this website http://www.bhutanpost.bt/ which I need to scrape.

The problem I am suspecting is that the site has charset as UTF-7 and the returned xml is not the same as it is shown in "view source". The elements I am trying to scrape do exist in the view source so I can say they are not pulled dynamically by JS.

Any help will be highly appreciaetd.

Upvotes: 1

Views: 363

Answers (1)

penjor
penjor

Reputation: 21

I dug through it and found a dirty fix, The problem was with the loadHtml function of DomCrawler which was inside parseXhtml function. When the meta tag is not explicitly defined loadHtml causes problems so here was my fix:

$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset='.$charset.'">'.$htmlContent);

I prepended the meta details before the HTML content.

Upvotes: 1

Related Questions