Ahmed
Ahmed

Reputation: 119

getting IOException: Premature EOF when running import.io

I have created a crawler using import.io the first issue I faced was that import.io could not identify the data on the webpage after clicking "Detect Optimal Settings". It asks "is the data you want to extract still in the browser?" As the data is not highlighted I click no. Even then the data is still not highlighted. The same thing happens with the extractor. I proceeded with the issue, by clicking yes when it asked "is the data you want to extract still in the browser?" even though the data was not highlighted. I went on to build the crawler and it works fine. I put around 15K urls in the start url with page depth 0.

What happens is that out of 15K pages, around 10% of the pages are not crawled. I checked the log file and it shows IOException: Premature EOF against the rows that were not crawled.

If I manually go to that page in a browser, the page loads fine and is in the same format in which I trained the crawler. I even tried to train the pages which showed this error, but that doesnt help.

How can I get around this error?

Upvotes: 3

Views: 124

Answers (1)

Meg Ainsley
Meg Ainsley

Reputation: 436

As I responded to your support ticket, thought it would be good to put that information here as well. This error is most likely related to the website detecting that you are using a crawler and blocking the URLs. I would suggest rerunning the crawler with increased "pause between pages", since you are passing through so many pages, in order for the site not to block you.

Upvotes: 1

Related Questions