Reputation: 3166
I am scraping a webpage using Selenium in Python. I am able to locate the elements using this code:
from selenium import webdriver
import codecs
driver = webdriver.Chrome()
driver.get("url")
results_table=driver.find_elements_by_xpath('//*[@id="content"]/table[1]/tbody/tr')
Each element in results_table
is in turn a set of sub-elements, with the number of sub-elements varying from element to element. My goal is to output each element, as a list or as a delimited string, into an output file. My code so far is this:
results_file=codecs.open(path+"results.txt","w","cp1252")
for element in enumerate(results_table):
element_fields=element.find_elements_by_xpath(".//*[text()][count(*)=0]")
element_list=[field.text for field in element_fields]
stuff_to_write='#'.join(element_list)+"\r\n"
results_file.write(stuff_to_write)
#print (i)
results_file.close()
driver.quit()
This second part of code takes about 2.5 minutes on a list of ~400 elements, each with about 10 sub-elements. I get the desired output, but it is too slow. What could I do to improve the prformance ?
Using python 3.6
Upvotes: 2
Views: 1145
Reputation: 5692
Download the whole page in one shot, then use something like BeautifulSoup to process it. I haven't used splinter or selenium in a while, but in Splinter, .html will give you the page. I'm not sure what the syntax is for that in Selenium, but there should be a way to grab the whole page.
Selenium (and Splinter, which is layered on top of Selenium) are notoriously slow for randomly accessing web page content. Looks like .page_source may give the entire contents of the page in Selenium, which I found at stackoverflow.com/questions/35486374/…. If reading all the chunks on the page one at a time is killing your performance (and it probably is), reading the whole page once and processing it offline will be oodles faster.
Upvotes: 1