Karl Nordgren
Karl Nordgren

Reputation: 31

Scraping a dynamically/Javascript generated website with Python/Selenium

I'm trying to scrape this website:

http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210

using Python and Selenium (see code below). The content is dynamically generated, and apparently data which is not visible in the browser is not loaded. I have tried making the browser window larger, and scrolling to the bottom of the page. Enlarging the window gets me all the data I want in the horizontal direction, but there is still plenty of data to scrape in the vertical direction. The scrolling appears not to work at all.

Does anyone have any bright ideas about how to do this?

Thanks!

from selenium import webdriver
import time

url = "http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210"
driver = webdriver.Firefox()
driver.get(url)
driver.set_window_position(0, 0)
driver.set_window_size(100000, 200000)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

time.sleep(5) # wait to load

soup = BeautifulSoup(driver.page_source)

table = soup.find("table", {"id":"DataTable"})

### get data
thead = table.find('tbody')
loopRows = thead.findAll('tr')
rows = []
for row in loopRows:
rows.append([val.text.encode('ascii', 'ignore') for val in  row.findAll(re.compile('td|th'))])
with open("body.csv", 'wb') as test_file:
  file_writer = csv.writer(test_file)
  for row in rows:
      file_writer.writerow(row)

Upvotes: 3

Views: 6706

Answers (2)

sam2426679
sam2426679

Reputation: 3827

You can do the scrolling by

self.driver.find_element_by_css_selector("html body.TVTableBody table#pageTable tbody tr td#cell4 table#MainTable tbody tr td#vScrollTD img[onmousedown='imgClick(this.sbar.visible,this,event);']").click()

It seems like once you can scroll the scraping should be pretty standard unless I'm missing something

Upvotes: 0

unutbu
unutbu

Reputation: 879341

This will get you as far as autosaving the entire csv to disk, but I haven't found a robust way to determine when the download is complete:

import os
import contextlib
import selenium.webdriver as webdriver
import csv
import time

url = "http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210"
download_dir = '/tmp'
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.dir", download_dir)
# 2 means "use the last folder specified for a download"
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-csv")

# driver = webdriver.Firefox(firefox_profile=fp)
with contextlib.closing(webdriver.Firefox(firefox_profile=fp)) as driver:
    driver.get(url)
    driver.execute_script("onDownload(2);")
    csvfile = os.path.join(download_dir, 'download.csv')

    # Wait for the download to complete
    time.sleep(10)
    with open(csvfile, 'rb') as f:
        for line in csv.reader(f, delimiter=','):
            print(line)

Explanation:

Point your browser to url. You'll see there is an Actions menu with an option to Download report data... and a suboption entitled "Comma-delimited ASCII format (*.csv)". If you inspect the HTML for these words you'll find

"Comma-delimited ASCII format (*.csv)","","javascript:onDownload(2);"

So it follows naturally that you might try getting webdriver to execute the JavaScript function call onDownload(2). We can do that with

driver.execute_script("onDownload(2);")

but normally another window will then pop up asking if you want save the file. To automate the saving-to-disk, I used the method described in this FAQ. The tricky part is finding the correct MIME type to specify on this line:

fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-csv")

The curl method described in the FAQ does not work here since we do not have a url for the csv file. However, this page describes another way to find the MIME type: Use a Firefox browser to open the save dialog. Check the checkbox saying "Do this automatically for files like this". Then inspect the last few lines of ~/.mozilla/firefox/*/mimeTypes.rdf for the most recently added description:

  <RDF:Description RDF:about="urn:mimetype:handler:application/x-csv"
                   NC:alwaysAsk="false"
                   NC:saveToDisk="true">
    <NC:externalApplication RDF:resource="urn:mimetype:externalApplication:application/x-csv"/>
  </RDF:Description>

This tells us the mime type is "application/x-csv". Bingo, we are in business.

Upvotes: 5

Related Questions