Reputation: 115
I'm just learning python and decided to play with some website scraping.
I created 1 that works, and a second, almost identical as far as I can tell, that doesn't work, and I can't figure out why.
from lxml import html
import requests
page = requests.get('https://thronesdb.com/set/Core')
tree = html.fromstring(page.content)
cards = [tree.xpath('//a[@class = "card-tip"]/text()'),tree.xpath('//td[@data-th = "Faction"]/text()'),
tree.xpath('//td[@data-th = "Cost"]/text()'),tree.xpath('//td[@data-th = "Type"]/text()'),
tree.xpath('//td[@data-th = "STR"]/text()'),tree.xpath('//td[@data-th = "Traits"]/text()'),
tree.xpath('//td[@data-th = "Set"]/text()'),tree.xpath('//a[@class = "card-tip"]/@data-code')]
print(cards)
That one does what I expect (I know it's not pretty). It grabs certain elements from a table on the site.
This one returns [[]]
:
from lxml import html
import requests
page = requests.get('http://www.redflagdeals.com/search/#!/q=baby%20monitor')
tree = html.fromstring(page.content)
offers = [tree.xpath('//a[@class = "offer_title"]/text()')]
print(offers)
What I expect it to do is give me a list that has the text from each offer_title element on the page.
The xpath I'm gunning at I grabbed from Firebug, which is:
/html/body/div[1]/div/div/div/section/div[2]/ul[1]/li[2]/div/h3/a
Here's the actual string from the site:
<a href="/deal/other-kids-babies/angelcare-digital-video-and-sound-monitor-8999-9000-off-9724/" class="offer_title">Angelcare Digital Video And Sound Monitor - $89.99 ($90.00 Off)</a>
I have also read a few other questions, but they didn't answer how this could work the first way, but not the second. Can't post them because of the link restrictions on new accounts. Titles:
Any help would be appreciated. I did some reading on the lxml website about xpath, but I may be missing something in the way I'm building a query.
Thanks!
Upvotes: 1
Views: 121
Reputation: 52685
The reason why the first code is working is that required data is initially present in DOM
while on second page required data is generated dynamically by JavaScript
, so you cannot scrape it because requests
doesn't support handling dynamic content.
You can try to use, for example, Selenium
+ PhantomJS
to get required data as below:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
driver = webdriver.PhantomJS(executable_path='/path/to/phantomJS')
driver.get('http://www.redflagdeals.com/search/#!/q=baby%20monitor')
xpath = '//a[@class = "offer_title"]'
wait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, xpath)))
offers = [link.get_attribute('textContent') for link in driver.find_elements_by_xpath(xpath)]
Upvotes: 0