Vasily A
Vasily A

Reputation: 8646

reading HTTPResponse object "empties" it

I read a webpage using urllib.request.urlopen:

import urllib.request
import shutil

my_response = urllib.request.urlopen('https://google.com') # object of HTTPResponse type

Then, I want both to save it as a file and use the variable for future processing in the code. If I try, for example, the following:

shutil.copyfileobj(my_response, open('gpage.html', 'wb')) # saved successfully
my_content = my_response.read() # empty

file is successfully saved but my_response becomes empty after that.
Vice versa, if I call .read() first, I can get the content but saved file will be empty:

my_content = my_response.read() # works as expected
shutil.copyfileobj(my_response, open('gpage.html', 'wb')) # empty file

i.e. I can only access my_content once. I remember this behavior is typical for some other types of python objects (all iterators?) but not sure what is the correct term for it. What would be recommended solution in my case, if I want both to write content to the file and keep it in a variable? (so far I use workaround with writing to the file and then reading it)

Upvotes: 1

Views: 392

Answers (1)

Jan Steinke
Jan Steinke

Reputation: 513

This is normal behaviour for any buffer (In this example it is a buffered reader), the opposite would be to read from a stream (stream readers). You can easily circumvent it by first writing it into your variable and do your operations on that variable:

my_content = my_response.read() # read from buffer and store in variable
with open('gpage.html', 'wb') as fp: 
    fp.write(my_content) # use the variable instead of the reader again
# do more stuff with my_content

The buffer gets emptied if you consume the data that is in it to make space for more data. In this case shutils.copyfileobj also calls .read() on the object and thus only the first one gets what's in the buffer.

Also: The documentation of urllibb.request recommends to open the url just like any other resource:

with open(urllib.request.urlopen('https://google.com')) as request:
    my_content = request.read()

this way the resource gets directly freed again after everything was read from the buffer and you are consuming less memory as soon as the with ...: scope ends.

Together that would make:

my_content = ""
with open(urllib.request.urlopen('https://google.com')) as request:
    my_content = request.read()
with open('gpage.html', 'wb') as fp: 
    fp.write(my_content)

Upvotes: 2

Related Questions