jmramosfran
jmramosfran

Reputation: 11

Python data scraping - Elementary concepts

I am trying to get my head around how data scraping works when you look past HTML (i.e. DOM scraping).

I've been trying to write a simple Python code to automatically retrieve the number of people that have seen a specific ad: the part where it says '3365 people viewed Peter's place this week.'

At first I tried to see if that was displayed in the HTML code but could not find it. Did some research and saw that not everything will be in the code as it can be processes by the browser through JavaScript or other languages that I don't quite understand yet. I then inspected the element and realised that I would need to use the Python library 'retrieve' and 'lxml.html'. So I wrote this code:

import requests
import lxml.html

response = requests.get('https://www.airbnb.co.uk/rooms/501171')
resptext = lxml.html.fromstring(response.text)
final = resptext.text_content()
finalu = final.encode('utf-8')

file = open('file.txt', 'w')

file.write(finalu) 

file.close()

With that, I get a code with all the text in the web page, but not the text that I am looking for! Which is the magic number 3365.

So my question is: how do I get it? I have thought that maybe I am not using the correct language to get the DOM, maybe it is done with JavaScript and I am only using lxml. However, I have no idea.

Upvotes: 1

Views: 178

Answers (2)

user862319
user862319

Reputation:

The DOM element you are looking at is updated after page load with what looks like an AJAX call with the following request URL:

https://www.airbnb.co.uk/rooms/501171/personalization.json

If you GET that URL, it will return the following JSON data:

{
   "extras_price":"£30",
   "preview_bar_phrases":{
      "steps_remaining":"<strong>1 step</strong> to list"
   },
   "flag_info":{

   },
   "user_is_admin":false,
   "is_owned_by_user":false,
   "is_instant_bookable":true,
   "instant_book_reasons":{
      "within_max_lead_time":null,
      "within_max_nights":null,
      "enough_lead_time":true,
      "valid_reservation_status":null,
      "not_country_or_village":true,
      "allowed_noone":null,
      "allowed_everyone":true,
      "allowed_socially_connected":null,
      "allowed_experienced_guest":null,
      "is_instant_book_host":true,
      "guest_has_profile_pic":null
   },
   "instant_book_experiments":{
      "ib_max_nights":14
   },
   "lat":51.5299601405844,
   "lng":-0.12462748035984603,
   "localized_people_pricing_description":"&pound;30 / night after 2 guests",
   "monthly_price":"&pound;4200",
   "nightly_price":"&pound;150",
   "security_deposit":"",
   "social_connections":{
      "connected":null
   },
   "staggered_price":"&pound;4452",
   "weekly_price":"&pound;1050",
   "show_disaster_info":false,
   "cancellation_policy":"Strict",
   "cancellation_policy_link":"/home/cancellation_policies#strict",
   "show_fb_cta":true,
   "should_show_review_translations":false,
   "listing_activity_data":{
      "day":{
         "unique_views":226,
         "total_views":363
      },
      "week":{
         "unique_views":3365,
         "total_views":5000
      }
   },
   "should_hide_action_buttons":false
}

If you look under "listing_activity_data" you will find the information you seek. Appending /personalization.json to any room URL seems to return this data (for now).

Update per the user agent issues

It looks like they are filtering requests to this URL based on user agent. I had to set the user agent on the urllib request in order to fix this:

import urllib2
import json


headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('http://www.airbnb.co.uk/rooms/501171/personalization.json', None, headers)
json = json.load(urllib2.urlopen(req))

print(json['listing_activity_data']['week']['unique_views'])

Upvotes: 2

d123
d123

Reputation: 1617

so first of all you need to figure out if that section of code has any unique tags. So if you look at the HTML tree you have

html > body > #room > ....... > #book-it-urgency-commitment > div > div > ... > div#media-body > b

The data you need is stored in a 'b' tag. I'm not sure about using lxml, but I usually use BeautifulSoup for my scraping.

You can reference http://www.crummy.com/software/BeautifulSoup/bs4/doc/ it's pretty straight forward.

Upvotes: 0

Related Questions