Reputation: 71
I'm using wget in terminal to download a large list of images.
example — $ wget -i images.txt
I have all the image URLS in the images.txt file.
However, the image urls tend to be like example.com/unqiueNumber/images/main_250.jpg
which means that all the images come out named main_250.jpg
What I really need is the images to be saved with the entire URLs for the images for each one, so that the 'unique number' is part of the filenames.
Any suggestions?
Upvotes: 3
Views: 4060
Reputation: 33685
With GNU Parallel you can do:
cat images.txt | parallel wget -O '{= s:/:-:g; =}' {}
Upvotes: 3
Reputation: 3040
I have a not so elegant solution, that may not work everywhere.
You probably know that if your URL ends in a query, wget will use that query in the filename. e.g. if you have http://domain/page?q=blabla
, you will get a file called page?q=blabla
after download. Usually, this is annoying, but you can turn it to your advantage.
Suppose, you wanted to download some index.html
pages, and wanted to keep track of their origin, as well as, avoid ending up with index.html
, index.html.1
, index.html.2
, etc. in your download folder. Your input file urls.txt
may look something like the following:
https://google.com/
https://bing.com/
https://duckduckgo.com/
If you call wget -i urls.txt
you end up with those numbered index.html files. But if you "doctor" your urls with a fake query, you get useful file names.
Write a script that appends each url as a query to itself, e.g.
https://google.com/?url=https://google.com/
https://bing.com/?url=https://bing.com/
https://duckduckgo.com/?url=https://duckduck.com/
Looks cheesy, right? But if you now execute wget -i urls.txt
, you get the following files:
index.html?url=https:%2F%2Fbing.com%2F
index.html?url=https:%2F%2Fduckduck.com%2F
index.html?url=https:%2F%2Fgoogle.com%2F
instead of non-descript numbered index.html
s. Sure, they look ugly, but you can clean up the filenames, and voilà! Each file will have its origin as its name.
The approach probably has some limitations, e.g. if the site you are downloading from actually executes the query and parses the parameters, etc.
Otherwise, you'll have to solve the file name/source url problem outside of wget
, either with a bash script or in other programming languages.
Upvotes: 0
Reputation: 51
Presuming the urls for the images are in a text file named images.txt with one url per line you can run
cat images.txt | sed 'p;s/\//-/g' | sed 'N;s/\n/ -O /' | xargs wget
to download each and every image with a filename that was formed out of the url.
Now for the explanation:
in this example I'll use https://www.newton.ac.uk/files/covers/968361.jpg
https://www.moooi.com/sites/default/files/styles/large/public/product-images/random_detail.jpg?itok=ErJveZTY
as images.txt (you can add as many images as you like to your file, as long as they are in this same format).
cat images.txt
pipes the content of the file to standard outputsed 'p;s/\//-/g'
prints the file to stdout with the url on one line and then the intended filename on the next line, like so:https://www.newton.ac.uk/files/covers/968361.jpg
https:--www.newton.ac.uk-files-covers-968361.jpg
https://www.moooi.com/sites/default/files/styles/large/public/product-images/random_detail.jpg?itok=ErJveZTY
https:--www.moooi.com-sites-default-files-styles-large-public-product-images-random_detail.jpg?itok=ErJveZTY
sed 'N;s/\n/ -O /'
combines the two lines of each image (the url and the intended filename) into one line and it adds the -O option inbetween (this is for wget to know that the second argument is the intended filename), the result for this part looks like this:https://www.newton.ac.uk/files/covers/968361.jpg -O https:--www.newton.ac.uk-files-covers-968361.jpg
https://www.moooi.com/sites/default/files/styles/large/public/product-images/random_detail.jpg?itok=ErJveZTY -O https:--www.moooi.com-sites-default-files-styles-large-public-product-images-random_detail.jpg?itok=ErJveZTY
xargs wget
runs wget for each line as the option, the endresult in this example is two images in the current directory named https:--www.newton.ac.uk-files-covers-968361.jpg
and https:--www.moooi.com-sites-default-files-styles-large-public-product-images-random_detail.jpg?itok=ErJveZTY
respectively.Upvotes: 5