Reputation: 95

How can I take the html file of a website

I am trying to take the html of my website and see if it is the same as what I have on an offline version.

I have been researching this, and all I can find is either parsing or something that deals with only http://

So far I have this:

import urllib
url = "https://www.mywebsite.com/"
onlinepage = urllib.urlopen(url)
print(onlinepage.read())
offlinepage = open("offline.txt", "w+")
print(offlinepage.read())

if onlinepage.read() == offlinepage.read():
    print("same") # for debugging
else:
    print("different")

This always says that they are the same, even when I put in a different website entirely.

Upvotes: 1

Answers (3)

ShadowRanger

Reputation: 155704

As others have noted, you can't read the request object twice (and can't read the file twice without seeking); once read, the data you got back is no longer available, so you need to store it.

But they missed another problem: You opened the file with mode w+. w+ allows both reading and writing, but, just like mode w, it truncates the file on open. So your local file is always empty when you read it, which means you're both corrupting the local file and never getting a match (unless the online file is empty too).

You need to use mode r+ or a+ to get a read/write handle that doesn't truncate the existing file (r+ requires that the file already exist, a+ does not, but puts the write position at end of file, and on some systems, all writes are put at the end of the file).

So fixing both bugs, you get:

import urllib
url = "https://www.mywebsite.com/"
# Using with statements properly for safe resource cleanup
with urllib.urlopen(url) as onlinepage:
    onlinedata = onlinepage.read()
print(onlinedata)

with open("offline.txt", "r+") as offlinepage:  # DOES NOT TRUNCATE EXISTING FILE!
    offlinedata = offlinepage.read()
    print(offlinedata)

    if onlinedata == offlinedata:
        print("same") # for debugging
    else:
        print("different")
        # I assume you want to rewrite the local page, or you wouldn't open with +
        # so this is what you'd do to ensure you replace the existing data correctly
        offlinepage.seek(0)     # Ensure you're seeked to beginning of file for write
        offlinepage.write(onlinedata)
        offlinepage.truncate()  # If online data smaller, don't keep offline extra data

Upvotes: 1

glibdud

Reputation: 7880

When you first print your online and offline pages with these lines:

print(onlinepage.read())
print(offlinepage.read())

...you have now consumed all of the text in each file object. Subsequent reads on either object will return an empty string. Two empty strings are equal, therefore your if condition will always evaluate to True.

If you were purely working with files, you could seek to the beginning of both files and read again. Since there is no seek method on the file object from urlopen, you'll need to either re-fetch the page with a new urlopen command or, better, save the original text in a variable and use that for your subsequent comparisons:

online = onlinepage.read()
print(online)
offline = offlinepage.read()
print(offline)

...

if online == offline:
    ...

Upvotes: 4

Sergii Shcherbak

Reputation: 987

You use .read() twice on each file.

>>> f.read()
'This is the entire file.\n'
>>> f.read()
''

"If the end of the file has been reached, f.read() will return an empty string ("")." (7.2.1 Docs).

Therefore, when two results are compared, they are equal because each is an empty string.

Upvotes: 0

How can I take the html file of a website

Answers (3)

Related Questions