Reputation: 76
I'm using some simple php to scrape information from a website to allow reading it offline. The code seems to be working fine but I am worried about undefined behaviour. The site is a bit poorly coded and some of the elements I'm grabbing share the same id with another element. I'd imagine that getElementById traverses the DOM from top to bottom and the reason I'm not having an issue is because the element I need is the first instance with the id. Is there any way to ensure this behaviour? The element has no other real way of distinguishing it so selecting it by id seems to be the best option. I have included a stripped back example of the code I'm using below.
Thanks.
<?php
$curl_referer = "http://example.com/";
$curl_url = "http://example.com/content.php";
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, 'Scraper/0.9');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_REFERER, "$curl_referer");
curl_setopt($ch, CURLOPT_URL, "$curl_url");
$output = curl_exec($ch);
$dom = new DOMDocument();
@$dom->loadHTML($output);
$content = $dom->getElementById('content');
echo $content->nodeValue;
?>
Upvotes: 0
Views: 85
Reputation: 1496
Try using XPath expression to get the first containing id.
Like that: //*[@id="content"][1]
The PHP code will be like that:
$xpath = new DOMXPath($dom);
$xpath->query('//*[@id="content"][1]')->item(0)->nodeValue;
And an tip: use libxml_use_internal_errors(true)
, you can catch they latter for logging or try tidying-up the document.
Edit
Hey, in your code you're setting the UA as "Scraper/0.9", most people that write a bad website doesn't look at that and doesn't do logging incoming requests in their pages, but, i don't recommend to put UA like that, just put an browser UA, like chrome's user-agent because if they're monitoring and see requests that contains this user-agent, they will be blacklist you (future).
Upvotes: 3