Univold
Univold

Reputation: 33

Web Scraping specific page with Python

Recently I've been learning web scraping with Python and Beautiful Soup. However I've hit a bit of a bump when trying to scrape the following page:

http://www.librarything.com/work/3203347

The data I want from the page is the tags for the book but I can't find any way to get the data despite spending a lot of time trawling the internet.

I tried following a few guides online but none of them seemed to work. I tried converting the page to XML and JSON but I still couldn't find the data.

Pretty stumped at the moment and I'd appreciate any help.

Thanks.

Upvotes: 1

Views: 1120

Answers (3)

BoboDarph
BoboDarph

Reputation: 2891

Possible implementation without BS:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

my_url = 'http://www.librarything.com/work/3203347'
driver = webdriver.Chrome()
driver.get(my_url)

delay = 5 # seconds

try:
    WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'span.tag')))
    print("Page is ready!")
    for element in driver.find_elements_by_css_selector('span.tag'):
        print(element.text)
except TimeoutException:
    print("Couldn't load page")
finally:
    driver.quit()

Sources for the implementation:

Waiting until an element identified by its css is present

Locating elements with selenium

Upvotes: 0

Vivek Harikrishnan
Vivek Harikrishnan

Reputation: 866

After analyzing the HTML and scripts, the tags are loaded through AJAX and requesting the AJAX url makes our life easy. Here is the python script.

import requests
from bs4 import BeautifulSoup

content = requests.get("http://www.librarything.com/ajax_work_makeworkCloud.php?work=3203347&check=2801929225").text
soup = BeautifulSoup(content)

for tag in soup.find_all('a'):
    print(tag)

Upvotes: 2

Goutham Santhakumar
Goutham Santhakumar

Reputation: 1

Am not sure about which data you want to scrape from the page. But when checked the page loads dynamic "Tags" through a javascript which is initiated once the page loads. If your scraper is loading only the DOM Controller and parsing the webpage in the background without loading in a browser its highly possible that any of the dynamic data in the page would not load.

One possible solution is using selenium to load the page completely and then scrape it.

Upvotes: 0

Related Questions