Reputation: 13811
def crawl(url):
html = getHTML(url) # getHTML() retruns HTTPResponse
print(html.read()) # PRINT STATMENT 1
if (html == None):
print("Error getting HTML")
else:
# parse html
bsObj = BeautifulSoup(html, "lxml")
# print data
try:
print(bsObj.h1.get_text())
except AttributeError as e:
print(e)
print(html.read()) # PRINT STAETMENT 2
What I don't understand is..
PRINT STATEMENT 1 prints the whole html whereas PRINT STATEMENT 2 prints only b''
What is happening here? ..I'm quite new to Python.
Upvotes: 0
Views: 53
Reputation: 27744
html
is an HTTPResponse object. HTTPResponse supports file-like operations, such as read()
.
Just like when reading a file, a read()
consumes the available data and moves the file pointer to the end of the file/data. A subsequent read()
has nothing to return.
You have two options:
Reset the file pointer to the beginning after reading using the seek()
method:
print(html.read())
html.seek(0) # moves the file pointer to byte 0 relative to the start of the file/data
Save the result instead:
html_body = html.read()
print(html_body)
Typically, you would use the second option as it'll be easier to re-use html_body
Upvotes: 1