Daniel Winston
Daniel Winston

Reputation: 13

HTML DOM to Download Image from <img> URI

I have created a list of all page uris I'd like to download an image from for a vehicle service manual.

The images are delivered via a PHP script,as can be seen here http://www.atfinley.com/service/index.php?cat=g2&page=32

This is probably meant to deter behaviors like my own, however, every single Acura Legend owner shouldn't depend on a single host for their vehicle's manual.

I'd like to design a bot in JS/Java that can visit every url I've stored in this txt document https://pastebin.com/yXdMJipq

To automate the download of the available png at the resource.

I'll eventually be creating a pdf of the manual, and publishing it for open and free use.

If anyone has ideas for libraries I could use, or ways to delve into the solution, please let me know. I am most fluent in Java.

I'm thinking a solution might be to fetch the html document at each url, and download the image from the <img src>argument.

Upvotes: 0

Views: 1725

Answers (4)

Daniel Winston
Daniel Winston

Reputation: 13

Scanner read;
    Writer write;
    try {
        File list = new File("F:/imgurls.txt");
        read = new Scanner(list);
        double s = 0;

        while(read.hasNextLine())
            try {
                s++;
                String url = read.nextLine();
                Response imageResponse = Jsoup.connect(url).ignoreContentType(true).execute();
                FileOutputStream writer = new FileOutputStream(new java.io.File("F:/Acura/" + (int) s + ".png"));
                writer.write(imageResponse.bodyAsBytes());
                writer.close();
                System.out.println((double)(s/2690) + "%");
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
        read.close();
    } catch (FileNotFoundException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }
    }

Worked for generating pngs

Upvotes: 0

Daniel Winston
Daniel Winston

Reputation: 13

Finished solution for grabbing image urls;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import java.io.Writer;
import java.util.Scanner;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class Acura {

public static void main(String[] args) throws IOException {

    Scanner read;
    Writer write;
    try {
        File list = new File("F:/result.txt");
        read = new Scanner(list);
        write = new FileWriter("F:/imgurls.txt");
        double s = 0;

        while(read.hasNextLine())
            try {
                s++;
                String url = read.nextLine();
                Document doc = Jsoup.connect(url).get();
                Element img = doc.select("img").first();
                String imgUrl = img.absUrl("src");
                write.write(imgUrl + "\n");
                System.out.println((double)(s/2690) + "%");
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
        read.close();
        write.close();
    } catch (FileNotFoundException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }
    }
}

Generates a nice long list of image urls in a text document. Could have done it in a non-sequential manner, but was heavily intoxicated when I did this. However I did add a progress bar for my own peace of mind :)

Upvotes: 0

tuberains
tuberains

Reputation: 193

I have written something similar but unfortunately, i can't find it anymore. Nevertheless, i remember using the JSoup Java-library which comes in pretty handy.

It includes an HTTP-client and you can run CSS-selectors on the document just like with jQuery...

This is the example from their frontpage:

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

Creating PDFs is quite tricky, but i use Apache PDFBox for such things...

Upvotes: 1

Nadav
Nadav

Reputation: 1145

I know you asked for a JavaScript solution but I believe PHP (which you also added as a tag) is more suitable for the task. Here are some guidelines to get you started:

  1. Move all the URLs into an array and create a foreach loop that will iterate on it.
  2. Inside the loop use the PHP Simple HTML DOM Parser to retrieve the image URL attribute for each page.
  3. Still inside the loop use the URL for the image in a CURL request to grab the file from that and save it into your custom folder. You can find the code required for this part here.

If this process proves to be too long and you get a PHP runtime error consider storing the URLs generated by step 2 in a file and then using that file to generate a new array and run step 3 on it as a separate process.

Upvotes: 1

Related Questions