user3157868
user3157868

Reputation: 61

Scraping dynamic web pages using Python 3.4 and beautifulsoup

OK, using Python 3.4 and beautifulsoup4 on a windows 7 VM. Having trouble scraping the data resulting from making a selection with a drop-down list. As a learning experience, I'm trying to write a scraper that can select the 4 year option on this page: www.nasdaq.com/symbol/ddd/historical and print the rows of the resulting table. So far, it just prints out the default 3 month table, along with some junk at the beginning that I don't want. Eventually I would like to scrape this data and write it to DB using mysql python connector, but for now I would just like to figure out how to make the 4 year selection in the drop down list. (also, would like to get rid of the text encoding that causes it to be in the b'blahblah' format. My code so far:

from bs4 import BeautifulSoup
import requests

url = 'http://www.nasdaq.com/symbol/ddd/historical'
with requests.Session() as session:
    session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}
    response = session.get(url)
    soup = BeautifulSoup(response.content)
    data = {
        'ddlTimeFrame': '4y'
    }
    response = session.post(url, data=data)
    soup = BeautifulSoup(response.content)
    for mytable in soup.find_all('tbody'):
        for trs in mytable.find_all('tr'):
            tds = trs.find_all('td')
            row = [elem.text.strip().encode('utf-8') for elem in tds]
            print (row)

I get no errors, but it doesn't print out the 4 year data. Thanks for your time/patience/help!

Upvotes: 1

Views: 2003

Answers (1)

GHajba
GHajba

Reputation: 3691

I do not know what you were doing but when I called your script I've got a response -- however it was the default site with the information of the last 3 months.

To get the data from the last 4 years you need to change your query a bit. If you look at the XHR request in your browser's developer tools you can see that the data sent to the server is 4y|false|DDD instead of 'ddlTimeFrame': '4y'.

The second change is the content-type header which you have to send along with your POST request:

session.headers['content-type'] = 'application/json'
data = "4y|false|DDD"

With these two little changes you get your desired data.

Upvotes: 3

Related Questions