Reputation: 21
I am building a scraping tool with the Laravel Goutte Package https://github.com/FriendsOfPHP/Goutte I have been able to scrape most websites until I came across this website http://www.bhutanpost.bt/ which I need to scrape.
The problem I am suspecting is that the site has charset as UTF-7 and the returned xml is not the same as it is shown in "view source". The elements I am trying to scrape do exist in the view source so I can say they are not pulled dynamically by JS.
Any help will be highly appreciaetd.
Upvotes: 1
Views: 363
Reputation: 21
I dug through it and found a dirty fix, The problem was with the loadHtml function of DomCrawler which was inside parseXhtml function. When the meta tag is not explicitly defined loadHtml causes problems so here was my fix:
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset='.$charset.'">'.$htmlContent);
I prepended the meta details before the HTML content.
Upvotes: 1