nickm
nickm

Reputation: 133

How Can I Scrape Data From Websites Don't Return Simple HTML

I have been using requests and BeautifulSoup for python to scrape html from basic websites, but most modern websites don't just deliver html as a result. I believe they run javascript or something (I'm not very familiar, sort of a noob here). I was wondering if anyone knows how to, say , search for a flight on google flights and scrape the top result aka the cheapest price??

If this were simple html, I could just parse the html tree and find the text result, but this does not appear when you view the "page source". If you inspect the element in your browser, you can see the price inside hmtl tags as if you were looking at the regular page source of a basic website.

What is going on here that the inspect element has the html but the page source doesn't? And does anyone know how to scrape this kind of data?

Thanks so much!

Inspect Element Javascript?

Upvotes: 5

Views: 2089

Answers (2)

Nathaniel Ford
Nathaniel Ford

Reputation: 21249

You might consider using Scrapy, which will allow you to scrape a page, along with a lot of other spider functionality. Scrapy has a great integration with Splash, which is a library you can use to execute the javascript in a page. Splash can be used stand-alone, or you can get the Scrapy-Splash.

Note that Splash essentially runs it's own server to do the javascript execution, so it's something that would run alongside your main script and would be called. Scrapy manages this via 'middleware', or a set processes that run on every request: in your case you would fetch the page, run the Javascript in Splash, and then parse the results.

This may be a slightly lighter-weight option than plugging into Selenium or the like, especially if all you're trying to do is render the page rather than render it and then interact with various parts in an automated fashion.

Upvotes: 2

Clara B
Clara B

Reputation: 461

You're spot on -- the page markup is getting added with javascript after the initial server response. I haven't used BeautifulSoup, but from its documentation, it looks like it doesn't execute javascript, so you're out of luck on that front.

You might try Selenium, which is basically a virtual browser -- people use it for front-end testing. It executes javascript, so it might be able to give you what you want.

But if you're specifically looking for Google Flights information, there's an API for that :) https://developers.google.com/qpx-express/v1/

Upvotes: 3

Related Questions