Reputation: 61
OK, using Python 3.4 and beautifulsoup4 on a windows 7 VM. Having trouble scraping the data resulting from making a selection with a drop-down list. As a learning experience, I'm trying to write a scraper that can select the 4 year option on this page: www.nasdaq.com/symbol/ddd/historical and print the rows of the resulting table. So far, it just prints out the default 3 month table, along with some junk at the beginning that I don't want. Eventually I would like to scrape this data and write it to DB using mysql python connector, but for now I would just like to figure out how to make the 4 year selection in the drop down list. (also, would like to get rid of the text encoding that causes it to be in the b'blahblah' format. My code so far:
from bs4 import BeautifulSoup
import requests
url = 'http://www.nasdaq.com/symbol/ddd/historical'
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.115 Safari/537.36'}
response = session.get(url)
soup = BeautifulSoup(response.content)
data = {
'ddlTimeFrame': '4y'
}
response = session.post(url, data=data)
soup = BeautifulSoup(response.content)
for mytable in soup.find_all('tbody'):
for trs in mytable.find_all('tr'):
tds = trs.find_all('td')
row = [elem.text.strip().encode('utf-8') for elem in tds]
print (row)
I get no errors, but it doesn't print out the 4 year data. Thanks for your time/patience/help!
Upvotes: 1
Views: 2003
Reputation: 3691
I do not know what you were doing but when I called your script I've got a response -- however it was the default site with the information of the last 3 months.
To get the data from the last 4 years you need to change your query a bit. If you look at the XHR request in your browser's developer tools you can see that the data sent to the server is 4y|false|DDD
instead of 'ddlTimeFrame': '4y'
.
The second change is the content-type
header which you have to send along with your POST
request:
session.headers['content-type'] = 'application/json'
data = "4y|false|DDD"
With these two little changes you get your desired data.
Upvotes: 3