Marreco
Marreco

Reputation: 77

Python with Selenium scraper skips some content

I'm trying to scrape a data from the website https://rsoe-edis.org/eventList and save to xlsx file. The scraper doesn't show any error but it skips some content. It saves all links but in some cases it doesn't show other information. Why?

import xlsxwriter
from datetime import datetime

now = (datetime.now()).strftime("%d-%m-%Y_%H-%M")

PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)

workbook = xlsxwriter.Workbook("RSOE_" + now + ".xlsx")

worksheet = workbook.add_worksheet("EventList") 

#Open the website
driver.get("https://rsoe-edis.org/eventList")

#Take events list
articles = driver.find_elements_by_tag_name("tr")
row = 0
col = 0

for article in articles:
        
        header = article.find_element_by_class_name("title")
        date = article.find_element_by_class_name("eventDate")
        location = article.find_element_by_class_name("location")
        link = article.find_element_by_tag_name("a")  
        worksheet.write(row, col,     header.text)
        worksheet.write(row, col + 1, date.text)
        worksheet.write(row, col + 2, location.text)
        worksheet.write(row, col + 3, link.get_attribute("href"))   

        print(header.text)

        row += 1      
workbook.close()      

driver.close()```

Upvotes: 1

Views: 188

Answers (2)

tbjorch
tbjorch

Reputation: 1758

Problem explanation

The problem in your case is that there are many event cards that are hidden (have style attributes display:none;), and Selenium can't provide the text content of hidden elements via the webelements .text attribute.

Solution

To interact with the hidden elements, you could among others:

  • fetch the webelements attribute values instead (e.g. .get_attribute("innerText")
  • use raw JavaScript to unhide the elements and then continue with .text.
  • use raw JavaScript to fetch all the webelements

Example getting the element text content using .get_attribute()

Here i use the .get_attribute() method of the webelement to get the content via the innerText attribute, then the string .strip() method to remove leading and trailing whitespaces

driver.get("https://rsoe-edis.org/eventList")
articles = driver.find_elements_by_tag_name("tr")
with open("my_articles.csv", "wt") as f:
    for article in articles:
        header = article.find_element_by_class_name("title").get_attribute("innerText").strip()
        date = article.find_element_by_class_name("eventDate").get_attribute("innerText").strip()
        location = article.find_element_by_class_name("location").get_attribute("innerText").strip()
        link = article.find_element_by_tag_name("a").get_attribute("href")
        f.write(f"{header}, {date}, {location}, {link}\n")

Example unhiding elements with raw JavaScript enabling .text

Below is an example where I use the second alternative to remove the style="display:none;" attribute from all the hidden cards, then continue with the webelements .text attribute to get the text content. What you would need from this example is the 3 rows below the comment # Loop through event list and unhide all event cards

#Open the website
driver.get("https://rsoe-edis.org/eventList")

# Loop through event list and unhide all event cards
event_cards = driver.find_elements_by_class_name("event-card")
for card in event_cards:
    driver.execute_script("arguments[0].removeAttribute(\"style\")", card)

# Find all articles and add them to a file
articles = driver.find_elements_by_tag_name("tr")
with open("my_articles.csv", "wt") as f:
    for article in articles:
        header = article.find_element_by_class_name("title").text
        date = article.find_element_by_class_name("eventDate").text
        location = article.find_element_by_class_name("location").text
        link = article.find_element_by_tag_name("a").get_attribute("href")
        f.write(f"{header}, {date}, {location}, {link}\n")

Upvotes: 1

PDHide
PDHide

Reputation: 19949

import xlsxwriter
from datetime import datetime

now = (datetime.now()).strftime("%d-%m-%Y_%H-%M")

PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)

workbook = xlsxwriter.Workbook("RSOE_" + now + ".xlsx")

worksheet = workbook.add_worksheet("EventList") 

#Open the website
driver.get("https://rsoe-edis.org/eventList")

#Take events list
articles = driver.find_elements_by_tag_name("tr")
row = 0
col = 0

for article in articles:
        
        header = article.find_element_by_class_name("title")
        date = article.find_element_by_class_name("eventDate")
        location = article.find_element_by_class_name("location")
        link = article.find_element_by_tag_name("a")  
        worksheet.write(row, col,     header.get_attribute("textContent"))
        worksheet.write(row, col + 1, date.get_attribute("textContent"))
        worksheet.write(row, col + 2, location.get_attribute("textContent"))
        worksheet.write(row, col + 3, link.get_attribute("href"))   

        print(header.get_attribute("textContent"))

        row += 1      
workbook.close()      

driver.close()

.text retrieves elements that is visible , use get_attribute("textContent") instead

Upvotes: 1

Related Questions