SafoineMoncefAmine
SafoineMoncefAmine

Reputation: 165

Can't BeautifulSoup show me the content of the website?

I want to scrape the contents of a website, using the library called BeautifulSoup.

Code:

from bs4 import BeautifulSoup
from urllib.request import urlopen
html_http_response = urlopen("http://www.airlinequality.com/airport-reviews/jeddah-airport/")
data = html_http_response.read()
soup = BeautifulSoup(data, "html.parser")
print(soup.prettify())

Output:

<html style="height:100%">
 <head>
  <meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
  <meta content="telephone=no" name="format-detection"/>
  <meta content="initial-scale=1.0" name="viewport"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
 </head>
 <body style="margin:0px;height:100%">
  <iframe frameborder="0" height="100%" marginheight="0px" marginwidth="0px" src="/_Incapsula_Resource?CWUDNSAI=9&amp;xinfo=9-57435048-0%200NNN%20RT%281512733380259%202%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B12%284%2c315%2c0%29%20U19&amp;incident_id=466002040110357581-305794245507288265&amp;edet=12&amp;cinfo=04000000" width="100%">
   Request unsuccessful. Incapsula incident ID: 466002040110357581-305794245507288265
  </iframe>
 </body>
</html>

The body contains an iFrame balise instead of the content shown when inspecting the content from the browser.

Upvotes: 2

Views: 2585

Answers (2)

skipper21
skipper21

Reputation: 2445

This website uses cookies to validate the requests. If you the website for the first time, you need to check I'm not Robot option. So it passes incap_ses_415_965359, PHPSESSID, visid_incap_965359, _ga and _gid values on the header of the requests and sends it.

So, I got cookies from chrome dev tool and saved it in a dictionary.

 from bs4 import BeautifulSoup
import requests

cookies = {
     'incap_ses_415_965359':'djRha9OqhshstDcXvPV8cmHCBQGBKloAAAAAN3/D9dvoqwEc7GPEwefkhQ==', 'PHPSESSID':'fjmr7plc0dmocm8roq7togcp92', 'visid_incap_965359':'akteT8lDT1iyST7XJO7wdQGBKloAAAns;aAAQkIPAAAAAACAWbWAAQ6Ozzrln35KG6DhLXMRYnMjxOmY', '_ga':'GA1.2.894579844.151uus2734989', '_gid':"GA1.2.1055878562.1598994989"
}
html_http_response = requests.get("http://www.airlinequality.com/airport-reviews/jeddah-airport", cookies=cookies)
data = html_http_response.text
soup = BeautifulSoup(data, "html.parser")
print(soup.prettify())

Get cookie values from your browser and update it

Upvotes: 7

The data you are looking for , don't exist yet cause this page has Java Jenerated Data. You must study on selenium library and you will find it ( it's rather easy). This means that the data you want only created when you actually load the page and click e.g. search button.(keep in mind that in iframes first you must select them).

Upvotes: 0

Related Questions