Reputation: 191
I'm trying to scrape Merriam-Webster's Medical Dictionary for medical terms using Python and Chrome as the Selenium webdriver. So far, this is what I have:
from os import path
from selenium import webdriver
# Adding an ad-blocker to Chrome to speed up page load times
options = webdriver.ChromeOptions()
options.add_extension(path.abspath("ublock-origin.crx"))
# Declaring the Selenium webdriver
driver = webdriver.Chrome(chrome_options = options)
# Fetching the "A" terms as a test set
driver.get("https://www.merriam-webster.com/browse/medical/a")
scraped_words = [] # The list that will hold each word
page_num = 1
while page_num < 55: # There are 54 pages of "A" terms
try:
for i in range(4): # There are 3 columns per page of words
column = "/html/body/div/div/div[5]/div[2]/div[1]/div/div[3]/ul/li[" + str(i) + "]/a"
number_of_words = len(driver.find_elements_by_xpath(column))
for j in range(number_of_words):
word = driver.find_elements_by_xpath(column + "[" + str(j) + "]")
scraped_words.append(word)
driver.find_element_by_class_name("fa-angle-right").click() # Next page
page_num += 1 # Increment page number to keep track of current page
except:
driver.close()
# Write out words to a file
with open("medical_terms.dict", "w") as text_file:
for i in range(len(scraped_words)):
text_file.write(str(scraped_words[i]))
text_file.write("\n")
driver.close()
The above code fetches all the items, as the output of len(scraped_words)
is the number expected. However, since I did not specify that I wanted to fetch the text of the elements, I get element identifiers (I think?) instead of text. If I decide to use word = driver.find_elements_by_xpath(column + "[" + str(j) + "]").text
in order to specify that I want to get the text of the element, I get the following error:
Traceback (most recent call last):
File "mw_download.py", line 20, in <module>
number_of_words = len(driver.find_elements_by_xpath(column))
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 325, in find_elements_by_xpath
return self.find_elements(by=By.XPATH, value=xpath)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 817, in find_elements
'value': value})['value']
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 256, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: no such session
(Driver info: chromedriver=2.31.488774 (7e15618d1bf16df8bf0ecf2914ed1964a387ba0b),platform=Mac OS X 10.12.6 x86_64)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "mw_download.py", line 27, in <module>
driver.close()
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 541, in close
self.execute(Command.CLOSE)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 256, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: no such session
(Driver info: chromedriver=2.31.488774 (7e15618d1bf16df8bf0ecf2914ed1964a387ba0b),platform=Mac OS X 10.12.6 x86_64)
What is strange to me here is that the only code I change between runs is on line 22 yet the error message points out line 20 instead.
Any help in deciphering what's going on here and what I can do to fix it would be much appreciated! :+)
Upvotes: 4
Views: 1172
Reputation: 6518
You just need to create a words
list accessing your elements texts, changing:
word = driver.find_elements_by_xpath(column + "[" + str(j) + "]")
to:
word = [i.text for i in driver.find_elements_by_xpath(column + "[" + str(j) + "]")]
Because .find_elements_by_xpath
will always return a list, accessing .text
directly won't work.
Upvotes: 3