Jay
Jay

Reputation: 629

Clone a single webpage (with images) and save to index.html

I want to clone a single webpage with all the images and no links in the html. I can achieve this with wget -E -H -k -K -p {url} however this pulls down the webpage with a full structure and you have to navigate to the html file to display the contents. This makes it inconsistent in where the html file to display the webpage would be.

I can also do this wget --no-check-certificate -O index.html -c -k {url} however this keeps the links to images and doesn't make the webpage truly local as it has to go out to the web to display the page properly.

Is there any way to clone a single webpage and spit out an index.html with the images linked locally?

PS: I am using wget through a python script that makes changes to webpages so having an index.html is neccesary for me. I am interested in other methods if there are better ones.

EDIT:

So it seems I haven't explain myself well but a bit background info on this project is I am working on a proof of concept for school on an automated phishing script which is supposed to clone a webpage, modify a few action tags and be placed on a local web server so that a user can navigate to it and the page will display correctly. Previously using the -O worked fine for my but since I am now incorporating DNS spoofing into my project the webpage cant have any links pointing externally as they will just end up getting rerouted to my internal webserver and the webpage will look broken. That is why I need to have just the information necessary for the single webpage to be displayed correctly but also have it predictable so that I am able to be sure that when i navigate to the directory I cloned the website from the webpage will be displayed (with proper links to images,css etc..)

Upvotes: 1

Views: 3171

Answers (2)

Harshith Thota
Harshith Thota

Reputation: 864

wget is a bash command. There's no point in invoking it through Python when you can directly achieve this task in Python. Basically what you're trying to make is a web scraper. Use requests and BeautifulSoup modules to achieve this. Research a bit about them and start writing a script. If you hit any errors, feel free to post a new question about it on SO.

Upvotes: 0

Mauricio Cortazar
Mauricio Cortazar

Reputation: 4213

use this wget facebook.com --domains website.org --no-parent --page-requisites --html-extension --convert-links if you wanna download all the entire website add --recursive after the web pages

Upvotes: 2

Related Questions