Reputation:
I am using Jsoup
to parse data from a website. We are talking a fairly big database with about 75000
entries. There are 19 categories, so 19 websites
with thousands of entries to parse. The problem is that Jsoup seems to be very slow at times, not being able to parse a single website in seconds, but sometimes it easily achieves multiple sites per second. What exactly is causing this inconsistency here which also lead to the infamous "java.net.SocketTimeoutException: Read timed out
"? My code:
public class Parser {
private int counter = 0;
private FileWriter fw;
private Elements elements;
void parse(String category) throws IOException {
try {
Document doc = Jsoup.connect("https://fddb.info/db/de/produktgruppen/" + category + "/index.html").get();
elements = doc.select("a[href^='https://fddb.info/db/de/lebensmittel']");
File file = new File("Data/" + category + ".txt");
fw = new FileWriter(file, true);
writeToFile();
fw.close();
} catch (Exception e) {
System.out.println("Timed out at " + counter);
writeToFile();
}
}
private void writeToFile() throws IOException {
try {
for (int i = counter; i < elements.size(); i++) {
Element element = elements.get(i);
Document elementDoc = Jsoup.connect(element.attr("href")).get();
// Headline
fw.write(elementDoc.select("#fddb-headline1").text() + "\n");
// Tags
Elements tags = elementDoc.select("a[href='https://fddb.info/db/de/lexikon/gesundheitsthemen/index.html']");
for (Element tag : tags) {
if (!tag.text().equals("Hinweis zu Gesundheitsthemen")) {
fw.write(tag.text() + "\n");
}
}
// Nutrition
Elements nutritions = elementDoc.select("div[style*='padding:2px 4px']");
for (Element nutrition : nutritions) {
fw.write(nutrition.text() + "\n");
}
counter++;
}
} catch (Exception e) {
System.out.println("Timed out at " + counter);
writeToFile();
}
}
}
I have already tried to face the exception by just recalling the parsing function in the catch-block, very hacky, I know...
Upvotes: 0
Views: 170
Reputation: 2941
I haven't tried your code because I don't know any category
but even without it the most obvious reason for this is server throttling. They detect you've downloaded a lot of data or executed many requests in short period of time so they decide to make you wait for x minutes before you can continue. Sounds fair.
What can you do?
while
loop to sleep x seconds and retry connections failed because of SocketTimeoutException
until server lets you get the data again. This way even if they refuse connection your application will retry over and over. Don't spam them with requests, wait 30, 60 seconds or more.Jsoup.connect(...).userAgent(getRandomUserAgent()).get();
and implement String getRandomUserAgent()
method to return random one of few strings you copied from https://developers.whatismybrowser.com/useragents/explore/software_name/chrome/ You will look like few different users instead of the same one.referrer
by using .referrer(getRandomReferrer())
and implement String getRandomReferrer()
method to return a random of few URLs, for example: https://fddb.info/db/de/suche/
or https://fddb.info/
.header("accept-encoding", "gzip, deflate")
Upvotes: 1