Jaber
Jaber

Reputation: 423

Can't scrape all HTML from Airbnb

I'm learning to scrape and am trying it out on Airbnb (here's the page). When I inspect one of the home images using Google Chrome, I see this: enter image description here

I can't get my script to return the HTML that represents the stuff pictured (e.g. the link to the listing). Initial attempt:

import requests    

url = "https://www.airbnb.co.uk/s/Rome/homes?checkin=2017-11-12&checkout=2017-11-19"
landing = requests.get(url)

print landing.content.find("rooms/")

That just returns a -1 (i.e. rooms/ isn't in the HTML).

Then some research threw up ideas about 'headers', so that Airbnb doesn't know I'm a script (the code is copy/pasted as I don't really get what these headers do). Someone else suggested using urllib instead. So the latest attempt is:

from urllib2 import Request,urlopen

user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'
headers = { 'User-Agent' : user_agent }

url = "https://www.airbnb.co.uk/s/Rome/homes?checkin=2017-11-12&checkout=2017-11-19"

req = Request(url,None,headers)
landing = urlopen(req)
print landing.read().find('rooms/')

This also returns a -1.

Any idea is much appreciated. I'm using Python 2.7 (Windows).

Upvotes: 0

Views: 1027

Answers (2)

Bart Van Loon
Bart Van Loon

Reputation: 1510

This happens because the content is only loaded into your browser window by javascript after the initial request has finished. Basically, this is because of the way Airbnb is populating the DOM of their pages.

In order to be able to scrape such pages, you will need more advanced tricks than simple requests, I'm afraid.

Two tips, if you're a beginner:

  • start with testing on simple websites, perhaps best static sites, if you can find any interesting ones
  • don't go for Python 2. Python 3 has been out for a long time now, so best to get started with that right away.

Good luck!

Upvotes: 2

amarynets
amarynets

Reputation: 1815

It happens because request doesn't run Javascript code. As a result you can't find rooms/. You could use Selenium or Splash.

If you open page source and try to find rooms/ you will find no results either.

Upvotes: 3

Related Questions