Reputation: 161
I have a piece of code that I am using to scrape data from various websites using the Jsoup library.
Connection conn = Jsoup.connect(url);
try {
doc = conn.get();
Element element = doc.getElementById(elementId);
System.out.println(element.html());
} catch (IOException e) {
e.printStackTrace();
}
The code works fine for most websites, however I have noticed that for one of the websites I am scraping from, this code does not work because the id for the HTML element of interest changes with each refresh of the page, by what appears to be the appending of a random number to the end of the id.
Is this done purposefully to prevent people from scraping data? If so, what is the best way (if any) of getting around it?
Upvotes: 1
Views: 124
Reputation: 11712
First thing: You should not scrape websites that did not give consent to your doing.
If you are feel that your scraping is legit, I would look for things in the html code that keep stable. It is not necessarily the id. class names are very often used in a similar and distinctive way.
In your described case, it sounds as if the base name of the id keeps stable. So you could do this:
Element element = doc.select(*[id^=baseID]).first();
This will select the first Element which has an id attribute that starts with "baseID". Look for CSS selectors in JSoup to learn more.
Upvotes: 1