Scott MacDonald
Scott MacDonald

Reputation: 115

LXML XPATH - Data returned from one site and not another

I'm just learning python and decided to play with some website scraping.

I created 1 that works, and a second, almost identical as far as I can tell, that doesn't work, and I can't figure out why.

from lxml import html
import requests

page = requests.get('https://thronesdb.com/set/Core')
tree = html.fromstring(page.content)

cards = [tree.xpath('//a[@class = "card-tip"]/text()'),tree.xpath('//td[@data-th = "Faction"]/text()'),
              tree.xpath('//td[@data-th = "Cost"]/text()'),tree.xpath('//td[@data-th = "Type"]/text()'),
              tree.xpath('//td[@data-th = "STR"]/text()'),tree.xpath('//td[@data-th = "Traits"]/text()'),
              tree.xpath('//td[@data-th = "Set"]/text()'),tree.xpath('//a[@class = "card-tip"]/@data-code')]

print(cards)

That one does what I expect (I know it's not pretty). It grabs certain elements from a table on the site.

This one returns [[]]:

from lxml import html
import requests

page = requests.get('http://www.redflagdeals.com/search/#!/q=baby%20monitor')
tree = html.fromstring(page.content)

offers = [tree.xpath('//a[@class = "offer_title"]/text()')]

print(offers)

What I expect it to do is give me a list that has the text from each offer_title element on the page.

The xpath I'm gunning at I grabbed from Firebug, which is:

/html/body/div[1]/div/div/div/section/div[2]/ul[1]/li[2]/div/h3/a

Here's the actual string from the site:

<a href="/deal/other-kids-babies/angelcare-digital-video-and-sound-monitor-8999-9000-off-9724/" class="offer_title">Angelcare Digital Video And Sound Monitor - $89.99 ($90.00 Off)</a>

I have also read a few other questions, but they didn't answer how this could work the first way, but not the second. Can't post them because of the link restrictions on new accounts. Titles:

Any help would be appreciated. I did some reading on the lxml website about xpath, but I may be missing something in the way I'm building a query.

Thanks!

Upvotes: 1

Views: 121

Answers (1)

Andersson
Andersson

Reputation: 52685

The reason why the first code is working is that required data is initially present in DOM while on second page required data is generated dynamically by JavaScript, so you cannot scrape it because requests doesn't support handling dynamic content.

You can try to use, for example, Selenium + PhantomJS to get required data as below:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait

driver = webdriver.PhantomJS(executable_path='/path/to/phantomJS')
driver.get('http://www.redflagdeals.com/search/#!/q=baby%20monitor')
xpath = '//a[@class = "offer_title"]'
wait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, xpath)))
offers = [link.get_attribute('textContent') for link in driver.find_elements_by_xpath(xpath)]

Upvotes: 0

Related Questions