Reputation: 185
I am using Phantomjs in selenium to scrape data from the link given in the snippet. While extracting the data with element.text in phantomjs(web_element), I am getting some blank values in between where as if I use chromedriver I was able to scrape all data.
I can only run using headless browser since I am running it in AWS Linux server
how can i scrape all the data without missing using phantomjs. Expecting some help here... thank you in advance
Below is the snippet attached
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.common.exceptions import NoSuchElementException
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
"(KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36")
driver = webdriver.PhantomJS(desired_capabilities = dcap,service_args=['--ignore-ssl-errors=true', '--load-images=false'])
driver.get("http://www.myntra.com/Dresses/Casual-Collection/Casual-Collection-by-Debenhams-Purple-Floral-Print-Maxi-Dress/348207/buy")
driver.implicitly_wait(5)
try:
driver.find_element_by_class_name("size-buttons-show-size-chart").click()
driver.implicitly_wait(10)
div_s = driver.find_elements_by_class_name("size-chart-cell")
# div_s = driver.find_elements_by_xpath("""//*[@id="mountRoot"]/div/div/div/div[3]/div/div[2]/div[1]/table/tbody/tr""")
size_data = ''
for s in div_s:
print str(s.text)
except NoSuchElementException:
print "NoSuchElementException"
Modified output:
Size XS S M L XL XXL 3XL
Brand Size UK10 UK12 UK14 UK16 UK18 UK20 UK22
Hips (INCHES) 36 38 40 42.5 45.25 48 50.75
31 41.75 # most Element is missing/ not able to scrape ???
Bust (INCHES) 34.25 36.25 38 40 43.75 46.5 49.25
Upvotes: 3
Views: 1399
Reputation: 185
I think i found the answer/reason behind it.
Thanks for your replay @alecxe i found my answer here...
The textContent property is "inhertied" from the Node interface of the DOM Core specification. The text property is "inherited" from the HTML5 HTMLAnchorElement interface and is specified as "must return the same value as the textContent IDL attribute".
The two are probably retained to converge different browser behaviour, the text property for script elements is defined slightly differently.
Note that the DOM specification is a general specification for any kind of document (e.g. HTML, XML, SGML, etc.) whereas HTML5 is specifically for HTML that leverages and extends the DOM Core in many respects (some might say it's a "super set" of a few DOM specs plus HTML plus …).
Note that "inherited" does not mean "prototype inheritance", just the more general meaning of inherited
Again Thank you for this...
Difference between text and textContent properties
Upvotes: 0
Reputation: 474211
Interesting problem. Using the textContent
would actually work in this case:
for s in div_s:
print(str(s.get_attribute("textContent")))
Differences between .text
, textContent
and other properties are perfectly described here:
Note that there is no point in calling the implicitly_wait()
multiple times - it does not act as time.sleep()
- meaning, it would not wait for a certain amount of time immediately - instead, it would only instruct the driver to set the "implicit wait" to the specified amount of seconds:
An implicit wait is to tell WebDriver to poll the DOM for a certain amount of time when trying to find an element or elements if they are not immediately available.
A better way to wait in this case would be to use Explicit Waits.
Upvotes: 1