Reputation: 31
I'm trying to scrape this website:
http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210
using Python and Selenium (see code below). The content is dynamically generated, and apparently data which is not visible in the browser is not loaded. I have tried making the browser window larger, and scrolling to the bottom of the page. Enlarging the window gets me all the data I want in the horizontal direction, but there is still plenty of data to scrape in the vertical direction. The scrolling appears not to work at all.
Does anyone have any bright ideas about how to do this?
Thanks!
from selenium import webdriver
import time
url = "http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210"
driver = webdriver.Firefox()
driver.get(url)
driver.set_window_position(0, 0)
driver.set_window_size(100000, 200000)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5) # wait to load
soup = BeautifulSoup(driver.page_source)
table = soup.find("table", {"id":"DataTable"})
### get data
thead = table.find('tbody')
loopRows = thead.findAll('tr')
rows = []
for row in loopRows:
rows.append([val.text.encode('ascii', 'ignore') for val in row.findAll(re.compile('td|th'))])
with open("body.csv", 'wb') as test_file:
file_writer = csv.writer(test_file)
for row in rows:
file_writer.writerow(row)
Upvotes: 3
Views: 6706
Reputation: 3827
You can do the scrolling by
self.driver.find_element_by_css_selector("html body.TVTableBody table#pageTable tbody tr td#cell4 table#MainTable tbody tr td#vScrollTD img[onmousedown='imgClick(this.sbar.visible,this,event);']").click()
It seems like once you can scroll the scraping should be pretty standard unless I'm missing something
Upvotes: 0
Reputation: 879341
This will get you as far as autosaving the entire csv to disk, but I haven't found a robust way to determine when the download is complete:
import os
import contextlib
import selenium.webdriver as webdriver
import csv
import time
url = "http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210"
download_dir = '/tmp'
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.dir", download_dir)
# 2 means "use the last folder specified for a download"
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-csv")
# driver = webdriver.Firefox(firefox_profile=fp)
with contextlib.closing(webdriver.Firefox(firefox_profile=fp)) as driver:
driver.get(url)
driver.execute_script("onDownload(2);")
csvfile = os.path.join(download_dir, 'download.csv')
# Wait for the download to complete
time.sleep(10)
with open(csvfile, 'rb') as f:
for line in csv.reader(f, delimiter=','):
print(line)
Explanation:
Point your browser to url
.
You'll see there is an Actions
menu with an option to Download report data...
and a suboption entitled "Comma-delimited ASCII format (*.csv)"
. If you inspect the HTML for these words you'll find
"Comma-delimited ASCII format (*.csv)","","javascript:onDownload(2);"
So it follows naturally that you might try getting webdriver
to execute the JavaScript function call onDownload(2)
. We can do that with
driver.execute_script("onDownload(2);")
but normally another window will then pop up asking if you want save the file. To automate the saving-to-disk, I used the method described in this FAQ. The tricky part is finding the correct MIME type to specify on this line:
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-csv")
The curl
method described in the FAQ does not work here since we do not have a url for the csv file. However, this page describes another way to find the MIME type: Use a Firefox browser to open the save dialog. Check the checkbox saying "Do this automatically for files like this". Then inspect the last few lines of ~/.mozilla/firefox/*/mimeTypes.rdf
for the most recently added description:
<RDF:Description RDF:about="urn:mimetype:handler:application/x-csv"
NC:alwaysAsk="false"
NC:saveToDisk="true">
<NC:externalApplication RDF:resource="urn:mimetype:externalApplication:application/x-csv"/>
</RDF:Description>
This tells us the mime type is "application/x-csv"
. Bingo, we are in business.
Upvotes: 5