Reputation: 241
I'm facing a problem that when I click on a button, then Javascript handle the action then it redirect to a new page with new window (It's similar to when you click on <a>
with target _Blank
). In the scrapy/splash I don't know how to get content from the new page (I means I don't know how to control that new page).
Anyone can help!
script = """
function main(splash)
assert(splash:go(splash.args.url))
splash:wait(0.5)
local element = splash:select('div.result-content-columns div.result-title')
local bounds = element:bounds()
element:mouse_click{x=bounds.width/2, y=bounds.height/2}
return splash:html()
end
"""
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, endpoint='execute', args={'lua_source': self.script})
Upvotes: 1
Views: 5041
Reputation: 985
The problem that you can't scrape html which is out of your selection scope. When a new link is clicked, if there is an iframe involved, it rarely brings it into scope for scraping.
Choose a method of selecting the new iframe, and then proceed to parse the new html.
(This is an adaptation of Mikhail Korobov's solution from this answer)
If you are able to get the src link of the new page that pops up, it may be the most reliable, however, you can also try selecting iframe this way:
# ...
yield SplashRequest(url, self.parse_result, endpoint='render.json',
args={'html': 1, 'iframes': 1})
def parse_result(self, response):
iframe_html = response.data['childFrames'][0]['html']
sel = parsel.Selector(iframe_html)
item = {
'my_field': sel.xpath(...),
# ...
}
(requires pip install selenium,bs4, and possibly a chrome driver download from here for your os: Selenium Chromedrivers) Supports Javascript parsing! Woohoo!
With the following code, this will switch scopes to the new frame:
# Goes at the top
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.options import Options
import time
# Your path depends on where you downloaded/located your chromedriver.exe
CHROME_PATH = 'C:\Program Files (x86)\Google\Chrome\Application\chrome.exe'
CHROMEDRIVER_PATH = 'chromedriver.exe'
WINDOW_SIZE = "1920,1080"
chrome_options = Options()
chrome_options.add_argument("--log-level=3")
chrome_options.add_argument("--headless") # Speeds things up if you don't need gui
chrome_options.add_argument("--window-size=%s" % WINDOW_SIZE)
chrome_options.binary_location = CHROME_PATH
browser = webdriver.Chrome(executable_path=CHROMEDRIVER_PATH, chrome_options=chrome_options)
url = "example_js_site.com" # Your site goes here
browser.get(url)
time.sleep(3) # An unsophisticated way to wait for the new page to load.
browser.switch_to.frame(0)
soup = BeautifulSoup(browser.page_source.encode('utf-8').strip(), 'lxml')
# This will return any content found in tags called '<table>'
table = soup.find_all('table')
My favorite of the two options is Selenium, but try the first solution if you are more comfortable with it!
Upvotes: 2