antone king
antone king

Reputation: 83

Python web scrape with Beautiful Soup

I am able to scrape this sites tables with no issue; however, to get access to the tables I customize I need to login first then scrape because if i do not i get a default output. I feel like i am close, but I am relatively new to python. Looking forward to learning more about mechanize and BeautifulSoup.

It seems to be logging in correctly due to the fact that I get an "incorrect password" error if I purposely enter a wrong password below, but how do i connect the login to the url i want to scrape?

from bs4 import BeautifulSoup
import urllib
import csv
import mechanize
import cookielib

cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("http://www.barchart.com/login.php")

br.select_form(nr=0)
br.form['email'] = 'username'
br.form['password'] = 'password'
br.submit()

#print br.response().read()

r = urllib.urlopen("http://www.barchart.com/stocks/sp500.php?view=49530&_dtp1=0").read()

soup = BeautifulSoup(r, "html.parser")

tables = soup.find("table", attrs={"class" : "datatable ajax"})

headers = [header.text for header in tables.find_all('th')]

rows = []

for row in tables.find_all('tr'):
    rows.append([val.text.encode('utf8') for val in row.find_all('td')])


with open('snp.csv', 'wb') as f:
    writer = csv.writer(f)
    writer.writerow(headers)
    writer.writerows(row for row in rows if row)

#from pymongo import MongoClient
#import datetime
#client = MongoClient('localhost', 27017)

print soup.table.get_text()

Upvotes: 0

Views: 716

Answers (1)

mhawke
mhawke

Reputation: 87074

I am not sure that you actually need to login to retrieve the URL in your question; I get the same results whether logged in or not.

However, if you do need to be logged in to access other data, the problem will be that you are logging in with mechanize, but then using urllib.urlopen() to access the page. There is no connection between the two, so any session data gathered by mechanize is not available to urlopen when it makes its request.

In this case you don't need to use urlopen() because you can open the URL and access the HTML with mechanize:

r = br.open("http://www.barchart.com/stocks/sp500.php?view=49530&_dtp1=0")
soup = BeautifulSoup(r.read(), "html.parser")

Upvotes: 2

Related Questions