Rick
Rick

Reputation: 11

Scrape Entire Website for Image URL's Only

A client has retained me to collect a list of images on a website. The database is a huge mess, images are stored all over the place (Some in S3, some on the local server). I need to produce a list of images that we will migrate from S3 to the new hosting company that we are moving the website to.

I've tried crawling the database dump using REGEXP and the image list that I am coming up with does not match what the site is actually using.

What I'm looking to do: Unleash a python script to crawl the entire website for all Image URL's. The website is WordPress, do there will be a lot of .jpg?8127 and such going on. I don't care about those, I can clean up the output later.

So, my objectives are:

-Write python script that follows every link on the website, parses the output for image links. -Dumps the results into a text file for cleanup and review

I am looking at using https://pypi.python.org/pypi/ImageScraper as part of this since it seems to make the most sense.

How might I best go about this?

Upvotes: 1

Views: 1824

Answers (1)

Alisher Gafurov
Alisher Gafurov

Reputation: 447

I think you need to check the scrapy project. With scrapy you can write the crawler and using the pipeline save images o url of the images.

Upvotes: 1

Related Questions