MildTomato
MildTomato

Reputation: 71

Use full URL as saved file name with wget

I'm using wget in terminal to download a large list of images.

example — $ wget -i images.txt

I have all the image URLS in the images.txt file.

However, the image urls tend to be like example.com/unqiueNumber/images/main_250.jpg

which means that all the images come out named main_250.jpg

What I really need is the images to be saved with the entire URLs for the images for each one, so that the 'unique number' is part of the filenames.

Any suggestions?

Upvotes: 3

Views: 4060

Answers (3)

Ole Tange
Ole Tange

Reputation: 33685

With GNU Parallel you can do:

cat images.txt | parallel wget -O '{= s:/:-:g; =}' {}

Upvotes: 3

imrek
imrek

Reputation: 3040

I have a not so elegant solution, that may not work everywhere.

You probably know that if your URL ends in a query, wget will use that query in the filename. e.g. if you have http://domain/page?q=blabla, you will get a file called page?q=blabla after download. Usually, this is annoying, but you can turn it to your advantage.

Suppose, you wanted to download some index.html pages, and wanted to keep track of their origin, as well as, avoid ending up with index.html, index.html.1, index.html.2, etc. in your download folder. Your input file urls.txt may look something like the following:

https://google.com/
https://bing.com/
https://duckduckgo.com/

If you call wget -i urls.txt you end up with those numbered index.html files. But if you "doctor" your urls with a fake query, you get useful file names.

Write a script that appends each url as a query to itself, e.g.

https://google.com/?url=https://google.com/
https://bing.com/?url=https://bing.com/
https://duckduckgo.com/?url=https://duckduck.com/

Looks cheesy, right? But if you now execute wget -i urls.txt, you get the following files:

index.html?url=https:%2F%2Fbing.com%2F
index.html?url=https:%2F%2Fduckduck.com%2F
index.html?url=https:%2F%2Fgoogle.com%2F

instead of non-descript numbered index.htmls. Sure, they look ugly, but you can clean up the filenames, and voilà! Each file will have its origin as its name.

The approach probably has some limitations, e.g. if the site you are downloading from actually executes the query and parses the parameters, etc.

Otherwise, you'll have to solve the file name/source url problem outside of wget, either with a bash script or in other programming languages.

Upvotes: 0

pinoaffe
pinoaffe

Reputation: 51

Presuming the urls for the images are in a text file named images.txt with one url per line you can run
cat images.txt | sed 'p;s/\//-/g' | sed 'N;s/\n/ -O /' | xargs wget
to download each and every image with a filename that was formed out of the url.

Now for the explanation:

in this example I'll use

https://www.newton.ac.uk/files/covers/968361.jpg https://www.moooi.com/sites/default/files/styles/large/public/product-images/random_detail.jpg?itok=ErJveZTY

as images.txt (you can add as many images as you like to your file, as long as they are in this same format).

  • cat images.txt pipes the content of the file to standard output
  • sed 'p;s/\//-/g' prints the file to stdout with the url on one line and then the intended filename on the next line, like so:

    https://www.newton.ac.uk/files/covers/968361.jpg https:--www.newton.ac.uk-files-covers-968361.jpg https://www.moooi.com/sites/default/files/styles/large/public/product-images/random_detail.jpg?itok=ErJveZTY https:--www.moooi.com-sites-default-files-styles-large-public-product-images-random_detail.jpg?itok=ErJveZTY
  • sed 'N;s/\n/ -O /' combines the two lines of each image (the url and the intended filename) into one line and it adds the -O option inbetween (this is for wget to know that the second argument is the intended filename), the result for this part looks like this:

    https://www.newton.ac.uk/files/covers/968361.jpg -O https:--www.newton.ac.uk-files-covers-968361.jpg https://www.moooi.com/sites/default/files/styles/large/public/product-images/random_detail.jpg?itok=ErJveZTY -O https:--www.moooi.com-sites-default-files-styles-large-public-product-images-random_detail.jpg?itok=ErJveZTY
  • and finally xargs wget runs wget for each line as the option, the endresult in this example is two images in the current directory named https:--www.newton.ac.uk-files-covers-968361.jpg and https:--www.moooi.com-sites-default-files-styles-large-public-product-images-random_detail.jpg?itok=ErJveZTY respectively.

Upvotes: 5

Related Questions