Reputation: 53
When I use the Urllib module, I can call/print/search the html of a website the first time, but when I try again it is gone. How can I keep the html throughout the program.
For example, when I try:
html = urllib.request.urlopen('http://www.bing.com/search?q=Mike&go=&qs=n&form=QBLH&filt=all&pq=mike&sc=8-2&sp=-1&sk=')
search = re.findall(r'Mike',str(html.read()))
search
I get:
['Mike','Mike','Mike','Mike']
But then when I try to do this a second time like so:
results = re.findall(r'Mike',str(html.read()))
I get:
[]
when calling 'result'.
Why is this and how can I stop it from happening/fix it?
Upvotes: 0
Views: 107
Reputation: 178409
In addition to the correct guess of @rvalik that you can only read a stream once, data = str(html.read())
is incorrect. urlopen
returns a bytes
object and str
returns the display representation of that object. An example:
>>> data = b'Mike'
>>> str(data)
"b'Mike'"
What you should do is either decode the bytes
object using the encoding of the HTML page (UTF-8 in this case):
from urllib.request import urlopen
import re
with urlopen('http://www.bing.com/search?q=Mike&go=&qs=n&form=QBLH&filt=all&pq=mike&sc=8-2&sp=-1&sk=') as html:
data = html.read().decode('utf8')
print(re.findall(r'Mike',data))
or search with a bytes object:
from urllib.request import urlopen
import re
with urlopen('http://www.bing.com/search?q=Mike&go=&qs=n&form=QBLH&filt=all&pq=mike&sc=8-2&sp=-1&sk=') as html:
data = html.read()
print(re.findall(rb'Mike',data))
Upvotes: 1
Reputation: 1559
Without being very well versed in python, I'm guessing html.read()
reads the http stream, so when you call it the second time there is nothing to read.
Try:
html = urllib.request.urlopen('http://www.bing.com/search?q=Mike&go=&qs=n&form=QBLH&filt=all&pq=mike&sc=8-2&sp=-1&sk=')
data = str(html.read())
search = re.findall(r'Mike',data)
search
And then use
results = re.findall(r'Mike',data)
Upvotes: 2