Dinu Duke
Dinu Duke

Reputation: 185

Scraping web data using PhantomJS and Selenium

I am using Phantomjs in selenium to scrape data from the link given in the snippet. While extracting the data with element.text in phantomjs(web_element), I am getting some blank values in between where as if I use chromedriver I was able to scrape all data.

I can only run using headless browser since I am running it in AWS Linux server

how can i scrape all the data without missing using phantomjs. Expecting some help here... thank you in advance

Below is the snippet attached

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.common.exceptions import NoSuchElementException
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
     "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
     "(KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36")
driver = webdriver.PhantomJS(desired_capabilities = dcap,service_args=['--ignore-ssl-errors=true', '--load-images=false'])
driver.get("http://www.myntra.com/Dresses/Casual-Collection/Casual-Collection-by-Debenhams-Purple-Floral-Print-Maxi-Dress/348207/buy")
driver.implicitly_wait(5)
try:
    driver.find_element_by_class_name("size-buttons-show-size-chart").click()
    driver.implicitly_wait(10)
    div_s = driver.find_elements_by_class_name("size-chart-cell")
    # div_s = driver.find_elements_by_xpath("""//*[@id="mountRoot"]/div/div/div/div[3]/div/div[2]/div[1]/table/tbody/tr""")
    size_data = ''
    for s in div_s:
        print str(s.text)
except NoSuchElementException:
    print "NoSuchElementException"

Modified output:

Size XS S M L XL XXL 3XL
Brand Size UK10 UK12 UK14 UK16 UK18 UK20 UK22
Hips (INCHES) 36 38 40 42.5 45.25 48 50.75
31 41.75 # most Element is missing/ not able to scrape ???
Bust (INCHES) 34.25 36.25 38 40 43.75 46.5 49.25

Actual table is : Size Chart

Upvotes: 3

Views: 1399

Answers (2)

Dinu Duke
Dinu Duke

Reputation: 185

I think i found the answer/reason behind it.

Thanks for your replay @alecxe i found my answer here...

The textContent property is "inhertied" from the Node interface of the DOM Core specification. The text property is "inherited" from the HTML5 HTMLAnchorElement interface and is specified as "must return the same value as the textContent IDL attribute".

The two are probably retained to converge different browser behaviour, the text property for script elements is defined slightly differently.

Note that the DOM specification is a general specification for any kind of document (e.g. HTML, XML, SGML, etc.) whereas HTML5 is specifically for HTML that leverages and extends the DOM Core in many respects (some might say it's a "super set" of a few DOM specs plus HTML plus …).

Note that "inherited" does not mean "prototype inheritance", just the more general meaning of inherited

Again Thank you for this...

Difference between text and textContent properties

Upvotes: 0

alecxe
alecxe

Reputation: 474211

Interesting problem. Using the textContent would actually work in this case:

for s in div_s:
    print(str(s.get_attribute("textContent")))

Differences between .text, textContent and other properties are perfectly described here:

Note that there is no point in calling the implicitly_wait() multiple times - it does not act as time.sleep() - meaning, it would not wait for a certain amount of time immediately - instead, it would only instruct the driver to set the "implicit wait" to the specified amount of seconds:

An implicit wait is to tell WebDriver to poll the DOM for a certain amount of time when trying to find an element or elements if they are not immediately available.

A better way to wait in this case would be to use Explicit Waits.

Upvotes: 1

Related Questions