user1443778
user1443778

Reputation: 591

Store raw HTML content from URL then get a InputStream from memory (not using a connection)

I have a kind of tricky problem involving multi-threading. What I do is that I use a thread pool (ExecutorService) that is tasked with opening connections and putting them in a LinkedBlockingQueue.

So far I have used:

//run method in "getter threads"
public void run() {

    try {

    URL url = new URL(url_s); //url_s is given as a constructor argument

    //if I am correct then url.openStream will wait until we have the content
    InputStream stream = url.openStream();

    Request req = new Request(); //a class with two variables:
    req.html_stream = new InputSource(stream);
    req.source = stream;

    //this is a class variable (LinkedBlockingQueue<Request>)
    blocking_queue.put(req);

    } catch  (Exception ex) {
    logger.info("Getter thread died from an exeption",ex);
    return;
    }
}

I then have consumer thread (java.lang.Thread) that takes these InputSources and InputStreams and does:

public void run() {
   while(running) {
        try {
            logger.info("waiting for data to eat");
            Request req = blocking_queue.take();
            if(req.html_stream != null)
            eat_data(req);
        } catch (Exception ex) {
            logger.error(ex);
            return;
        }
   }
}

Where eat_data calls an external library that takes InputSource. The library uses a singleton instance to do the processing so I cant put this step in the "getter" threads.

When I tested this code for small amounts of data it worked fine, but when I supplied it with several thousands of URLs I started to have real problems. Its not easy to find out exactly what is wrong, but I suspect that the connections time out before the consumer thread get to them, sometimes even causing deadlock.

I implemented it this way because it was so easy to go from url.openStream() to InputSource but I realize that I really must store the data locally for this to work.

How do I get from url.openStream() to some object I can store in my LinkedBlockingQueue (all data in memory) that I can later turn into an InputSoruce when my consumer thread has time to process it?

Upvotes: 2

Views: 641

Answers (1)

Sripathi Krishnan
Sripathi Krishnan

Reputation: 31528

You can copy the contents of the URL into a ByteArrayOutputStream and close the URL Stream. Then store the ByteArrayInputStream in the queue.

Pseudo Code :

InputStream in = null;
try {
    in = url.openStream();
    ByteArrayOutputStream buffer = new ByteArrayOutputStream();
    IOUtils.copy(in, buffer);

    ByteArrayInputStream bin = new ByteArrayInputStream(buffer.toByteArray());
    queue.put(bin);
}

References :

  1. java.io.ByteArrayInputStream
  2. java.io.ByteArrayOutputStream
  3. org.apache.commons.io.IOUtils.IOUtils

Upvotes: 2

Related Questions