Reputation: 308
I'm trying to scrape the ads from Ask, which are generated in an iframe by a JS hosted by Google.
When I manually navigate my way through, and view source, there they are (I'm specifically looking for a div with the id "adBlock", which is in an iframe).
But when I try using Firefox, Chromedriver or FirefoxPortable, the source returned to me is missing all of the elements I'm looking for.
I tried scraping with urllib2 and had the same results, even when adding in the necessary headers. I thought for sure that a physical browser instance like Webdriver creates would have fixed that problem.
Here's the code I'm working off of, which had to be cobbled together from a few different sources:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pprint
# Create a new instance of the Firefox driver
driver = webdriver.Chrome('C:\Python27\Chromedriver\chromedriver.exe')
driver.get("http://www.ask.com")
print driver.title
inputElement = driver.find_element_by_name("q")
# type in the search
inputElement.send_keys("baseball hats")
# submit the form (although google automatically searches now without submitting)
inputElement.submit()
try:
WebDriverWait(driver, 10).until(EC.title_contains("baseball"))
print driver.title
output = driver.page_source
print(output)
finally:
driver.quit()
I know I circle through a few different attempts at viewing the source, that's not what I'm concerned about.
Any thoughts as to why I'm getting one result from this script (ads omitted) and a totally different result (ads present) from the browser it opened in? I've tried Scrapy, Selenium, Urllib2, etc. No joy.
Upvotes: 2
Views: 2612
Reputation: 9029
Selenium only displays the contents of the current frame or iframe. You'll have to switch into the iframes using something along these lines
iframes = driver.find_elements_by_tag_name("iframe")
for iframe in iframes
driver.switch_to_default_content()
driver.switch_to_frame(iframe)
output = driver.page_source
print(output)
Upvotes: 3