Reputation: 8646
I read a webpage using urllib.request.urlopen
:
import urllib.request
import shutil
my_response = urllib.request.urlopen('https://google.com') # object of HTTPResponse type
Then, I want both to save it as a file and use the variable for future processing in the code. If I try, for example, the following:
shutil.copyfileobj(my_response, open('gpage.html', 'wb')) # saved successfully
my_content = my_response.read() # empty
file is successfully saved but my_response
becomes empty after that.
Vice versa, if I call .read()
first, I can get the content but saved file will be empty:
my_content = my_response.read() # works as expected
shutil.copyfileobj(my_response, open('gpage.html', 'wb')) # empty file
i.e. I can only access my_content
once. I remember this behavior is typical for some other types of python objects (all iterators?) but not sure what is the correct term for it.
What would be recommended solution in my case, if I want both to write content to the file and keep it in a variable? (so far I use workaround with writing to the file and then reading it)
Upvotes: 1
Views: 392
Reputation: 513
This is normal behaviour for any buffer (In this example it is a buffered reader), the opposite would be to read from a stream (stream readers). You can easily circumvent it by first writing it into your variable and do your operations on that variable:
my_content = my_response.read() # read from buffer and store in variable
with open('gpage.html', 'wb') as fp:
fp.write(my_content) # use the variable instead of the reader again
# do more stuff with my_content
The buffer gets emptied if you consume the data that is in it to make space for more data. In this case shutils.copyfileobj
also calls .read()
on the object and thus only the first one gets what's in the buffer.
Also: The documentation of urllibb.request
recommends to open the url just like any other resource:
with open(urllib.request.urlopen('https://google.com')) as request:
my_content = request.read()
this way the resource gets directly freed again after everything was read from the buffer and you are consuming less memory as soon as the with ...:
scope ends.
Together that would make:
my_content = ""
with open(urllib.request.urlopen('https://google.com')) as request:
my_content = request.read()
with open('gpage.html', 'wb') as fp:
fp.write(my_content)
Upvotes: 2