Reputation: 95
I am trying to take the html of my website and see if it is the same as what I have on an offline version.
I have been researching this, and all I can find is either parsing or something that deals with only http://
So far I have this:
import urllib
url = "https://www.mywebsite.com/"
onlinepage = urllib.urlopen(url)
print(onlinepage.read())
offlinepage = open("offline.txt", "w+")
print(offlinepage.read())
if onlinepage.read() == offlinepage.read():
print("same") # for debugging
else:
print("different")
This always says that they are the same, even when I put in a different website entirely.
Upvotes: 1
Views: 87
Reputation: 155403
As others have noted, you can't read
the request object twice (and can't read
the file twice without seeking); once read, the data you got back is no longer available, so you need to store it.
But they missed another problem: You opened the file with mode w+
. w+
allows both reading and writing, but, just like mode w
, it truncates the file on open
. So your local file is always empty when you read it, which means you're both corrupting the local file and never getting a match (unless the online file is empty too).
You need to use mode r+
or a+
to get a read/write handle that doesn't truncate the existing file (r+
requires that the file already exist, a+
does not, but puts the write position at end of file, and on some systems, all writes are put at the end of the file).
So fixing both bugs, you get:
import urllib
url = "https://www.mywebsite.com/"
# Using with statements properly for safe resource cleanup
with urllib.urlopen(url) as onlinepage:
onlinedata = onlinepage.read()
print(onlinedata)
with open("offline.txt", "r+") as offlinepage: # DOES NOT TRUNCATE EXISTING FILE!
offlinedata = offlinepage.read()
print(offlinedata)
if onlinedata == offlinedata:
print("same") # for debugging
else:
print("different")
# I assume you want to rewrite the local page, or you wouldn't open with +
# so this is what you'd do to ensure you replace the existing data correctly
offlinepage.seek(0) # Ensure you're seeked to beginning of file for write
offlinepage.write(onlinedata)
offlinepage.truncate() # If online data smaller, don't keep offline extra data
Upvotes: 1
Reputation: 7840
When you first print your online and offline pages with these lines:
print(onlinepage.read())
print(offlinepage.read())
...you have now consumed all of the text in each file object. Subsequent reads on either object will return an empty string. Two empty strings are equal, therefore your if
condition will always evaluate to True
.
If you were purely working with files, you could seek
to the beginning of both files and read again. Since there is no seek
method on the file object from urlopen
, you'll need to either re-fetch the page with a new urlopen
command or, better, save the original text in a variable and use that for your subsequent comparisons:
online = onlinepage.read()
print(online)
offline = offlinepage.read()
print(offline)
...
if online == offline:
...
Upvotes: 4
Reputation: 977
You use .read()
twice on each file.
>>> f.read()
'This is the entire file.\n'
>>> f.read()
''
"If the end of the file has been reached, f.read() will return an empty string (""
)." (7.2.1 Docs).
Therefore, when two results are compared, they are equal because each is an empty string.
Upvotes: 0