hx chua
hx chua

Reputation: 1

Scraping no display hidden visibility python

I'm trying to scrape data from a website using Beautifulsoup in python, and when I parsed the page, the information that I want to scrape doesn't show up, and instead I see this:

<span class="frwp-debug hidden" style="display: none!important; visibility: hidden!important;">  

The parsed html is different from what I see when I inspect the page.

This is my code:

site = "http://www.fifa.com/worldcup/stories/y=2017/m=11/news=australia-2921204.html#World_Cup_History" 
hdr = {'User-Agent': 'Mozilla/5.0'} 
page = requests.get(site) 
soup = BeautifulSoup(page.text, "html.parser") 
print(soup.prettify())

How do I scrape the hidden information?

Upvotes: 0

Views: 1988

Answers (1)

jchung
jchung

Reputation: 953

The problem is that the content you want is being created by javascript after the page is loaded. BeautifulSoup can't parse that content through the requests library. Fortunately, you can use the Selenium library together with PhantomJS to get the fully rendered data, and then use BeautifulSoup to parse the resulting (finished) html.

Here's how that would work in your case:

from bs4 import BeautifulSoup
from selenium import webdriver

site = "http://www.fifa.com/worldcup/stories/y=2017/m=11/news=australia-2921204.html#World_Cup_History"
browser = webdriver.PhantomJS()
browser.get(site)
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())

That should solve your problem.

Note that you'll have to install a couple of things, including selenium pip install selenium and the PhantomJS webdriver (downloadable from http://phantomjs.org/download.html -- you may have to add it to your system path depending on how you install. I used this SO answer for that.)

Upvotes: 1

Related Questions