Sang Huynh
Sang Huynh

Reputation: 241

Scrapy/Splash Click on a button then get content from new page in new window

I'm facing a problem that when I click on a button, then Javascript handle the action then it redirect to a new page with new window (It's similar to when you click on <a> with target _Blank). In the scrapy/splash I don't know how to get content from the new page (I means I don't know how to control that new page).

Anyone can help!

script = """
    function main(splash)
        assert(splash:go(splash.args.url))
        splash:wait(0.5)
        local element = splash:select('div.result-content-columns div.result-title')
        local bounds = element:bounds()
        element:mouse_click{x=bounds.width/2, y=bounds.height/2}
        return splash:html()
    end
"""

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url, self.parse, endpoint='execute', args={'lua_source': self.script})

Upvotes: 1

Views: 5041

Answers (1)

Daniel Scott
Daniel Scott

Reputation: 985

Issue:

The problem that you can't scrape html which is out of your selection scope. When a new link is clicked, if there is an iframe involved, it rarely brings it into scope for scraping.

Solution:

Choose a method of selecting the new iframe, and then proceed to parse the new html.

The Scrapy-Splash method

(This is an adaptation of Mikhail Korobov's solution from this answer)

If you are able to get the src link of the new page that pops up, it may be the most reliable, however, you can also try selecting iframe this way:

# ...
    yield SplashRequest(url, self.parse_result, endpoint='render.json', 
                        args={'html': 1, 'iframes': 1})

def parse_result(self, response):
    iframe_html = response.data['childFrames'][0]['html']
    sel = parsel.Selector(iframe_html)
    item = {
        'my_field': sel.xpath(...),
        # ...  
    }

The Selenium method

(requires pip install selenium,bs4, and possibly a chrome driver download from here for your os: Selenium Chromedrivers) Supports Javascript parsing! Woohoo!

With the following code, this will switch scopes to the new frame:

# Goes at the top
from bs4 import BeautifulSoup 
from selenium.webdriver.chrome.options import Options
import time

# Your path depends on where you downloaded/located your chromedriver.exe
CHROME_PATH = 'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe'
CHROMEDRIVER_PATH = 'chromedriver.exe'
WINDOW_SIZE = "1920,1080"

chrome_options = Options()
chrome_options.add_argument("--log-level=3")
chrome_options.add_argument("--headless") # Speeds things up if you don't need gui
chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)

chrome_options.binary_location = CHROME_PATH

browser = webdriver.Chrome(executable_path=CHROMEDRIVER_PATH, chrome_options=chrome_options)

url = "example_js_site.com" # Your site goes here
browser.get(url)
time.sleep(3) # An unsophisticated way to wait for the new page to load.
browser.switch_to.frame(0)

soup = BeautifulSoup(browser.page_source.encode('utf-8').strip(), 'lxml')

# This will return any content found in tags called '<table>'
table = soup.find_all('table') 

My favorite of the two options is Selenium, but try the first solution if you are more comfortable with it!

Upvotes: 2

Related Questions