Reputation: 165
I want to scrape the contents of a website, using the library called BeautifulSoup.
Code:
from bs4 import BeautifulSoup
from urllib.request import urlopen
html_http_response = urlopen("http://www.airlinequality.com/airport-reviews/jeddah-airport/")
data = html_http_response.read()
soup = BeautifulSoup(data, "html.parser")
print(soup.prettify())
Output:
<html style="height:100%">
<head>
<meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
<meta content="telephone=no" name="format-detection"/>
<meta content="initial-scale=1.0" name="viewport"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
</head>
<body style="margin:0px;height:100%">
<iframe frameborder="0" height="100%" marginheight="0px" marginwidth="0px" src="/_Incapsula_Resource?CWUDNSAI=9&xinfo=9-57435048-0%200NNN%20RT%281512733380259%202%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c315%2c0%29%20U19&incident_id=466002040110357581-305794245507288265&edet=12&cinfo=04000000" width="100%">
Request unsuccessful. Incapsula incident ID: 466002040110357581-305794245507288265
</iframe>
</body>
</html>
The body contains an iFrame balise instead of the content shown when inspecting the content from the browser.
Upvotes: 2
Views: 2585
Reputation: 2445
This website uses cookies to validate the requests. If you the website for the first time, you need to check I'm not Robot
option. So it passes incap_ses_415_965359, PHPSESSID, visid_incap_965359, _ga and _gid values on the header of the requests and sends it.
So, I got cookies from chrome dev tool and saved it in a dictionary.
from bs4 import BeautifulSoup
import requests
cookies = {
'incap_ses_415_965359':'djRha9OqhshstDcXvPV8cmHCBQGBKloAAAAAN3/D9dvoqwEc7GPEwefkhQ==', 'PHPSESSID':'fjmr7plc0dmocm8roq7togcp92', 'visid_incap_965359':'akteT8lDT1iyST7XJO7wdQGBKloAAAns;aAAQkIPAAAAAACAWbWAAQ6Ozzrln35KG6DhLXMRYnMjxOmY', '_ga':'GA1.2.894579844.151uus2734989', '_gid':"GA1.2.1055878562.1598994989"
}
html_http_response = requests.get("http://www.airlinequality.com/airport-reviews/jeddah-airport", cookies=cookies)
data = html_http_response.text
soup = BeautifulSoup(data, "html.parser")
print(soup.prettify())
Get cookie values from your browser and update it
Upvotes: 7
Reputation: 915
The data you are looking for , don't exist yet cause this page has Java Jenerated Data. You must study on selenium library and you will find it ( it's rather easy). This means that the data you want only created when you actually load the page and click e.g. search button.(keep in mind that in iframes first you must select them).
Upvotes: 0