hydradon
hydradon

Reputation: 1436

Scrapy with Selenium does not detect HTML element loaded dynamically

I am using Scrapy with Selenium to scrape content from this page: https://nikmikk.itch.io/door-knocker

In it, there is a table under the div with class .game_info_panel_widget, where the first row Published 62 days ago seems to be loaded dynamically.

I have try fetching the page as Scrapy sees but cannot find that row in the html.

scrapy fetch --nolog https://nikmikk.itch.io/door-knocker > test.html

Here is what I see in test.html, the first table row is the Status, not the Published row like when I view page source directly in Chrome.

<div class="game_info_panel_widget">                                                                                                                                         
    <table>                                                                                                                                              
        <tbody>                                                                                                                                                  
           <tr>                                                                                                                                                      
               <td>Status</td>                                                                                                                                                       
               <td>Prototype</td>                                                                                                                                                            
               ...                                                                                                                                               

           </tr>
            ...

In my class SpiderDownloaderMiddleware, I have included Selenium:

options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('window-size=1200x600')

driver = webdriver.Chrome(chrome_options=options)

class SpiderDownloaderMiddleware(object):
# Omitted other codes
    def process_request(self, request, spider):
        driver.get(request.url)

        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, ".game_info_panel_widget"))
        )

        body = driver.page_source
        return HtmlResponse(driver.current_url, body=body, encoding='utf-8-sig', request=request)

How do I check how that row is loaded and how I can scrape those infos?

Updated: I followed @Yosuva A 's answer below and got something like this:

 9 days ago

In development
Platforms
Windows
Rating
(9)
Author
David Clark
Genre
Survival, Puzzle
Tags
3D, Creepy, First-Person, Horror, Psychological Horror, Short, Singleplayer, Spooky, Unity
Average session
A few seconds
Languages
English

But the output is inconsistent, sometimes it gives the desired one, sometimes it doesn't. I guess because Selenium waits for the general td element, which is common:

"//div[@class='game_info_panel_widget']//table//tr//td"

I have tried to modified to use td[@text='Published'] but Selenium gives timeout.

My code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome('chromedriver')  # Optional argument, if not specified will search path.
driver.implicitly_wait(15)

driver.get("https://thehive.itch.io/promnesia");
driver.find_element(By.XPATH,"//a[@class='toggle_info_btn']").click()

WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@class='game_info_panel_widget']//table//tr//td"))) #Wait for specific element 

table_rows= driver.find_elements(By.XPATH,"//div[@class='game_info_panel_widget']//table//tr//td")

for rows in table_rows:
    print(rows.text)

driver.quit()

Any other way?

Conclusion: It works if we time.sleep(2) after click() as suggested by Yosuva A.

Upvotes: 0

Views: 237

Answers (1)

Yosuva Arulanthu
Yosuva Arulanthu

Reputation: 1574

Please let me know whether this help or not

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome('/usr/local/bin/chromedriver')  # Optional argument, if not specified will search path.
driver.implicitly_wait(15)

driver.get("https://thehive.itch.io/promnesia");
driver.find_element(By.XPATH,"//a[@class='toggle_info_btn']").click()
time.sleep(2)
WebDriverWait(driver, 3).until(EC.presence_of_element_located((By.XPATH, "//div[@class='game_info_panel_widget']/table//tr//td"))) #Wait for specific element 

table_rows= driver.find_elements(By.XPATH,"//div[@class='game_info_panel_widget']/table//tr//td")

for rows in table_rows:
    print rows.text

driver.quit()

Output

Updated
1 day ago
Published
9 days ago
Status
In development
Platforms
Windows
Rating
(9)
Author
David Clark
Genre
Survival, Puzzle
Tags
3D, Creepy, First-Person, Horror, Psychological Horror, Short, Singleplayer, Spooky, Unity
Average session
A few seconds
Languages
English

Upvotes: 1

Related Questions