user13210907
user13210907

Reputation:

Constant SocketTimeoutException using Jsoup

I am using Jsoup to parse data from a website. We are talking a fairly big database with about 75000 entries. There are 19 categories, so 19 websites with thousands of entries to parse. The problem is that Jsoup seems to be very slow at times, not being able to parse a single website in seconds, but sometimes it easily achieves multiple sites per second. What exactly is causing this inconsistency here which also lead to the infamous "java.net.SocketTimeoutException: Read timed out"? My code:

public class Parser {
    private int counter = 0;
    private FileWriter fw;
    private Elements elements;

    void parse(String category) throws IOException {
        try {
            Document doc = Jsoup.connect("https://fddb.info/db/de/produktgruppen/" + category + "/index.html").get();
            elements = doc.select("a[href^='https://fddb.info/db/de/lebensmittel']");
            File file = new File("Data/" + category + ".txt");
            fw = new FileWriter(file, true);
            writeToFile();
            fw.close();
        } catch (Exception e) {
            System.out.println("Timed out at " + counter);
            writeToFile();
        }
    }

    private void writeToFile() throws IOException {
        try {
            for (int i = counter; i < elements.size(); i++) {
                Element element = elements.get(i);
                Document elementDoc = Jsoup.connect(element.attr("href")).get();
                // Headline
                fw.write(elementDoc.select("#fddb-headline1").text() + "\n");
                // Tags
                Elements tags = elementDoc.select("a[href='https://fddb.info/db/de/lexikon/gesundheitsthemen/index.html']");
                for (Element tag : tags) {
                    if (!tag.text().equals("Hinweis zu Gesundheitsthemen")) {
                        fw.write(tag.text() + "\n");
                    }
                }
                // Nutrition
                Elements nutritions = elementDoc.select("div[style*='padding:2px 4px']");
                for (Element nutrition : nutritions) {
                    fw.write(nutrition.text() + "\n");
                }
                counter++;
            }
        } catch (Exception e) {
            System.out.println("Timed out at " + counter);
            writeToFile();
        }
    }
}

I have already tried to face the exception by just recalling the parsing function in the catch-block, very hacky, I know...

Upvotes: 0

Views: 170

Answers (1)

Krystian G
Krystian G

Reputation: 2941

I haven't tried your code because I don't know any category but even without it the most obvious reason for this is server throttling. They detect you've downloaded a lot of data or executed many requests in short period of time so they decide to make you wait for x minutes before you can continue. Sounds fair.
What can you do?

  • You can wait x seconds between consecutive requests to slip under the radar and they won't detect you. Experiment with waiting time value. Maybe 1 second will be enough, maybe 5, maybe 10...
  • You can also use while loop to sleep x seconds and retry connections failed because of SocketTimeoutException until server lets you get the data again. This way even if they refuse connection your application will retry over and over. Don't spam them with requests, wait 30, 60 seconds or more.
  • Try randomizing useragent Jsoup.connect(...).userAgent(getRandomUserAgent()).get(); and implement String getRandomUserAgent() method to return random one of few strings you copied from https://developers.whatismybrowser.com/useragents/explore/software_name/chrome/ You will look like few different users instead of the same one.
  • Another thing you may do to look like different user and prevent potential blocking is randomizing referrer by using .referrer(getRandomReferrer()) and implement String getRandomReferrer() method to return a random of few URLs, for example: https://fddb.info/db/de/suche/ or https://fddb.info/
  • Generally a good tip is to ask server for compressed response so the data size will be smaller and downloading faster so use .header("accept-encoding", "gzip, deflate")

Upvotes: 1

Related Questions