Zacarias
Zacarias

Reputation: 11

Is it possible to scrape a JavaScript-based website using Goutte/PHP?

I want to web scrape several websites, which apparently rendered using JavaScript. To be specific, I want to target this website: http://cve.mitre.org/find/index.html

This is my code:

$client = new Client();

$crawler = $client->request('GET', 'http://cve.mitre.org/find/index.html');
$form = $crawler->selectButton('Search')->form();
$crawler = $client->submit($form, array('search' => 'Symphony'));

print $crawler->html();

If I view the source code, I don't see the HTML because this request is done by JavaScript, so, does anyone know how to scrape these kind of websites?

Upvotes: 1

Views: 3003

Answers (1)

halfer
halfer

Reputation: 20439

This site has bolted on a lazy "Google custom search" rather than implement their own, which means that the site comes with all sorts of JavaScript cruft.

It looks like the actual search might be done by a traditional form submission, you just need to post to a form using the elements that Google renders. However, it may not be that easy, since Google may check referrers and so forth, and prevent it anyway.

You have a few options, I think:

  • Use a headless browser like PhantomJS to run the search. You can try driving this directly, or use something like Spiderling. This will definitely work, but it's a bit slower than running a simple browser like Goutte, and it requires admin rights to get running on a server
  • Scrape Google directly with a domain:cve.mitre.org as appropriate
  • Sign up to a Google search API and use that directly
  • Try injecting the required form into Goutte and submitting the form to Google (difficult to know if it will work until you try)

Upvotes: 3

Related Questions