Reputation: 1436
I am using Scrapy with Selenium to scrape content from this page: https://nikmikk.itch.io/door-knocker
In it, there is a table under the div with class .game_info_panel_widget
, where the first row Published 62 days ago
seems to be loaded dynamically.
I have try fetching the page as Scrapy sees but cannot find that row in the html.
scrapy fetch --nolog https://nikmikk.itch.io/door-knocker > test.html
Here is what I see in test.html
, the first table row is the Status, not the Published row like when I view page source directly in Chrome.
<div class="game_info_panel_widget">
<table>
<tbody>
<tr>
<td>Status</td>
<td>Prototype</td>
...
</tr>
...
In my class SpiderDownloaderMiddleware
, I have included Selenium:
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('window-size=1200x600')
driver = webdriver.Chrome(chrome_options=options)
class SpiderDownloaderMiddleware(object):
# Omitted other codes
def process_request(self, request, spider):
driver.get(request.url)
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".game_info_panel_widget"))
)
body = driver.page_source
return HtmlResponse(driver.current_url, body=body, encoding='utf-8-sig', request=request)
How do I check how that row is loaded and how I can scrape those infos?
Updated: I followed @Yosuva A 's answer below and got something like this:
9 days ago
In development
Platforms
Windows
Rating
(9)
Author
David Clark
Genre
Survival, Puzzle
Tags
3D, Creepy, First-Person, Horror, Psychological Horror, Short, Singleplayer, Spooky, Unity
Average session
A few seconds
Languages
English
But the output is inconsistent, sometimes it gives the desired one, sometimes it doesn't. I guess because Selenium waits for the general td
element, which is common:
"//div[@class='game_info_panel_widget']//table//tr//td"
I have tried to modified to use td[@text='Published']
but Selenium gives timeout.
My code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome('chromedriver') # Optional argument, if not specified will search path.
driver.implicitly_wait(15)
driver.get("https://thehive.itch.io/promnesia");
driver.find_element(By.XPATH,"//a[@class='toggle_info_btn']").click()
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@class='game_info_panel_widget']//table//tr//td"))) #Wait for specific element
table_rows= driver.find_elements(By.XPATH,"//div[@class='game_info_panel_widget']//table//tr//td")
for rows in table_rows:
print(rows.text)
driver.quit()
Any other way?
Conclusion:
It works if we time.sleep(2)
after click()
as suggested by Yosuva A.
Upvotes: 0
Views: 237
Reputation: 1574
Please let me know whether this help or not
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome('/usr/local/bin/chromedriver') # Optional argument, if not specified will search path.
driver.implicitly_wait(15)
driver.get("https://thehive.itch.io/promnesia");
driver.find_element(By.XPATH,"//a[@class='toggle_info_btn']").click()
time.sleep(2)
WebDriverWait(driver, 3).until(EC.presence_of_element_located((By.XPATH, "//div[@class='game_info_panel_widget']/table//tr//td"))) #Wait for specific element
table_rows= driver.find_elements(By.XPATH,"//div[@class='game_info_panel_widget']/table//tr//td")
for rows in table_rows:
print rows.text
driver.quit()
Updated
1 day ago
Published
9 days ago
Status
In development
Platforms
Windows
Rating
(9)
Author
David Clark
Genre
Survival, Puzzle
Tags
3D, Creepy, First-Person, Horror, Psychological Horror, Short, Singleplayer, Spooky, Unity
Average session
A few seconds
Languages
English
Upvotes: 1