Reputation: 77
I'm trying to scrape a data from the website https://rsoe-edis.org/eventList and save to xlsx file. The scraper doesn't show any error but it skips some content. It saves all links but in some cases it doesn't show other information. Why?
import xlsxwriter
from datetime import datetime
now = (datetime.now()).strftime("%d-%m-%Y_%H-%M")
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
workbook = xlsxwriter.Workbook("RSOE_" + now + ".xlsx")
worksheet = workbook.add_worksheet("EventList")
#Open the website
driver.get("https://rsoe-edis.org/eventList")
#Take events list
articles = driver.find_elements_by_tag_name("tr")
row = 0
col = 0
for article in articles:
header = article.find_element_by_class_name("title")
date = article.find_element_by_class_name("eventDate")
location = article.find_element_by_class_name("location")
link = article.find_element_by_tag_name("a")
worksheet.write(row, col, header.text)
worksheet.write(row, col + 1, date.text)
worksheet.write(row, col + 2, location.text)
worksheet.write(row, col + 3, link.get_attribute("href"))
print(header.text)
row += 1
workbook.close()
driver.close()```
Upvotes: 1
Views: 188
Reputation: 1758
The problem in your case is that there are many event cards that are hidden (have style attributes display:none;
), and Selenium can't provide the text content of hidden elements via the webelements .text
attribute.
To interact with the hidden elements, you could among others:
.get_attribute("innerText")
.text
..get_attribute()
Here i use the .get_attribute()
method of the webelement to get the content via the innerText
attribute, then the string .strip()
method to remove leading and trailing whitespaces
driver.get("https://rsoe-edis.org/eventList")
articles = driver.find_elements_by_tag_name("tr")
with open("my_articles.csv", "wt") as f:
for article in articles:
header = article.find_element_by_class_name("title").get_attribute("innerText").strip()
date = article.find_element_by_class_name("eventDate").get_attribute("innerText").strip()
location = article.find_element_by_class_name("location").get_attribute("innerText").strip()
link = article.find_element_by_tag_name("a").get_attribute("href")
f.write(f"{header}, {date}, {location}, {link}\n")
.text
Below is an example where I use the second alternative to remove the style="display:none;"
attribute from all the hidden cards, then continue with the webelements .text attribute to get the text content. What you would need from this example is the 3 rows below the comment # Loop through event list and unhide all event cards
#Open the website
driver.get("https://rsoe-edis.org/eventList")
# Loop through event list and unhide all event cards
event_cards = driver.find_elements_by_class_name("event-card")
for card in event_cards:
driver.execute_script("arguments[0].removeAttribute(\"style\")", card)
# Find all articles and add them to a file
articles = driver.find_elements_by_tag_name("tr")
with open("my_articles.csv", "wt") as f:
for article in articles:
header = article.find_element_by_class_name("title").text
date = article.find_element_by_class_name("eventDate").text
location = article.find_element_by_class_name("location").text
link = article.find_element_by_tag_name("a").get_attribute("href")
f.write(f"{header}, {date}, {location}, {link}\n")
Upvotes: 1
Reputation: 19949
import xlsxwriter
from datetime import datetime
now = (datetime.now()).strftime("%d-%m-%Y_%H-%M")
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
workbook = xlsxwriter.Workbook("RSOE_" + now + ".xlsx")
worksheet = workbook.add_worksheet("EventList")
#Open the website
driver.get("https://rsoe-edis.org/eventList")
#Take events list
articles = driver.find_elements_by_tag_name("tr")
row = 0
col = 0
for article in articles:
header = article.find_element_by_class_name("title")
date = article.find_element_by_class_name("eventDate")
location = article.find_element_by_class_name("location")
link = article.find_element_by_tag_name("a")
worksheet.write(row, col, header.get_attribute("textContent"))
worksheet.write(row, col + 1, date.get_attribute("textContent"))
worksheet.write(row, col + 2, location.get_attribute("textContent"))
worksheet.write(row, col + 3, link.get_attribute("href"))
print(header.get_attribute("textContent"))
row += 1
workbook.close()
driver.close()
.text retrieves elements that is visible , use get_attribute("textContent") instead
Upvotes: 1