Zanam
Zanam

Reputation: 4807

Python 3.5 beautifulsoup unable to read page

When I go through the following process:

The above steps takes me to the following url: http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=228792

where you can see the data.

However, if I use the following code:

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
soup = BeautifulSoup(urlopen(url).read())
print soup

I get the error:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
<title>Travis Property Search</title>
<style type="text/css">
      body { text-align: center; padding: 150px; }
      h1 { font-size: 50px; }
      body { font: 20px Helvetica, sans-serif; color: #333; }
      #article { display: block; text-align: left; width: 650px; margin: 0 auto; }
      a { color: #dc8100; text-decoration: none; }
      a:hover { color: #333; text-decoration: none; }
    </style>
</head>
<body>
<div id="article">
<h1>Please try again</h1>
<div>
<p>Sorry for the inconvenience but your session has either timed out or the server is busy handling other requests. You may visit us on the the following website for information, otherwise please retry your search again shortly:<br /><br />
<a href="http://www.traviscad.org/">Travis Central Appraisal District Website</a> </p>
<p><b><a href="http://propaccess.traviscad.org/clientdb/?cid=1">Click here to reload the property search to try again</a></b></p>
</div>
</div>
</body>
</html>

I have tried other ways of importing cookie, etc but I am not able to read the data using python.

Upvotes: 0

Views: 142

Answers (1)

Mark
Mark

Reputation: 92471

Try something like this:

import requests
from bs4 import BeautifulSoup

s = requests.session()
r = s.get('http://propaccess.traviscad.org/clientdb/?cid=1')
r2 = s.get('http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669')

soup = BeautifulSoup(r2.text, 'html.parser')
print(soup.prettify())

This will grab the page that establishes the session and requests.session will save the session data. On the next request it will use the session cookie and grab your text. You should be able to hand that text to BeautifulSoup for parsing.

Upvotes: 1

Related Questions