Reputation: 21
I'm trying to extract data from a webpage, for example, lets say I wish to fetch information from chess.org.
I know the player's ID is 25022, which means I can request http://www.chess.org.il/Players/Player.aspx?Id=25022
In that page I can see that this player's fide ID = 2821109.
From that, I can request this page:
http://ratings.fide.com/card.phtml?event=2821109
And from that I can see that stdRating=1602.
How can I get the "stdRating" output from a given "localID" input in Java?
(localID, fideID and stdRating are aid parameters that I use to clarify the question)
Upvotes: 1
Views: 2017
Reputation: 6289
You could try the univocity-html-parser, which is very easy to use and avoids a lot of spaghetti code.
To get the standard rating for example you can use this code:
public static void main(String... args) {
UrlReaderProvider url = new UrlReaderProvider("http://ratings.fide.com/card.phtml?event={EVENT}");
url.getRequest().setUrlParameter("EVENT", 2821109);
HtmlElement doc = HtmlParser.parseTree(url);
String rating = doc.query()
.match("small").withText("std.")
.match("br").getFollowingText()
.getValue();
System.out.println(rating);
}
Which produces the value 1602
.
But getting data by querying individual nodes and trying to stitch all pieces together is not exactly easy.
I expanded the code to illustrate how you can use the parser to get more information into records. Here I created records for the player and her rank details which are available in the table of the second page. It took me less than 1h to get this done:
public static void main(String... args) {
UrlReaderProvider url = new UrlReaderProvider("http://www.chess.org.il/Players/Player.aspx?Id={PLAYER_ID}");
url.getRequest().setUrlParameter("PLAYER_ID", 25022);
HtmlEntityList entities = new HtmlEntityList();
HtmlEntitySettings player = entities.configureEntity("player");
player.addField("id").match("b").withExactText("מספר שחקן").getFollowingText().transform(s -> s.replaceAll(": ", ""));
player.addField("name").match("h1").followedImmediatelyBy("b").withExactText("מספר שחקן").getText();
player.addField("date_of_birth").match("b").withExactText("תאריך לידה:").getFollowingText();
player.addField("fide_id").matchFirst("a").attribute("href", "http://ratings.fide.com/card.phtml?event=*").getText();
HtmlLinkFollower playerCard = player.addField("fide_card_url").matchFirst("a").attribute("href", "http://ratings.fide.com/card.phtml?event=*").getAttribute("href").followLink();
playerCard.addField("rating_std").match("small").withText("std.").match("br").getFollowingText();
playerCard.addField("rating_rapid").match("small").withExactText("rapid").match("br").getFollowingText();
playerCard.addField("rating_blitz").match("small").withExactText("blitz").match("br").getFollowingText();
playerCard.setNesting(Nesting.REPLACE_JOIN);
HtmlEntitySettings ratings = playerCard.addEntity("ratings");
configureRatingsBetween(ratings, "World Rank", "National Rank ISR", "world");
configureRatingsBetween(ratings, "National Rank ISR", "Continent Rank Europe", "country");
configureRatingsBetween(ratings, "Continent Rank Europe", "Rating Chart", "continent");
Results<HtmlParserResult> results = new HtmlParser(entities).parse(url);
HtmlParserResult playerData = results.get("player");
String[] playerFields = playerData.getHeaders();
for(HtmlRecord playerRecord : playerData.iterateRecords()){
for(int i = 0; i < playerFields.length; i++){
System.out.print(playerFields[i] + ": " + playerRecord.getString(playerFields[i]) +"; ");
}
System.out.println();
HtmlParserResult ratingData = playerRecord.getLinkedEntityData().get("ratings");
for(HtmlRecord ratingRecord : ratingData.iterateRecords()){
System.out.print(" * " + ratingRecord.getString("rank_type") + ": ");
System.out.println(ratingRecord.fillFieldMap(new LinkedHashMap<>(), "all_players", "active_players", "female", "u16", "female_u16"));
}
}
}
private static void configureRatingsBetween(HtmlEntitySettings ratings, String startingHeader, String endingHeader, String rankType) {
Group group = ratings.newGroup()
.startAt("table").match("b").withExactText(startingHeader)
.endAt("b").withExactText(endingHeader);
group.addField("rank_type", rankType);
group.addField("all_players").match("tr").withText("World (all", "National (all", "Rank (all").match("td", 2).getText();
group.addField("active_players").match("tr").followedImmediatelyBy("tr").withText("Female (active players):").match("td", 2).getText();
group.addField("female").match("tr").withText("Female (active players):").match("td", 2).getText();
group.addField("u16").match("tr").withText("U-16 Rank (active players):").match("td", 2).getText();
group.addField("female_u16").match("tr").withText("Female U-16 Rank (active players):").match("td", 2).getText();
}
The output will be:
id: 25022; name: יעל כהן; date_of_birth: 02/02/2003; fide_id: 2821109; rating_std: 1602; rating_rapid: 1422; rating_blitz: 1526;
* world: {all_players=195907, active_players=94013, female=5490, u16=3824, female_u16=586}
* country: {all_players=1595, active_players=1024, female=44, u16=51, female_u16=3}
* continent: {all_players=139963, active_players=71160, female=3757, u16=2582, female_u16=372}
Hope it helps
Disclosure: I'm the author of this library. It's commercial closed source but it can save you a lot of development time.
Upvotes: 2
Reputation: 2220
As @Alex R pointed out, you'll need a Web Scraping library for this.
The one he recommended, JSoup, is quite robust and is pretty commonly used for this task in Java, at least in my experience.
You'd first need to construct a document that fetches your page, eg:
int localID = 25022; //your player's ID.
Document doc = Jsoup.connect("http://www.chess.org.il/Players/Player.aspx?Id=" + localID).get();
From this Document Object, you can fetch a lot of information, for example the FIDE ID you requested, unfortunately the web page you linked inst very simple to scrape, and you'll need to basically go through every link on the page to find the relevant link, for example:
Elements fidelinks = doc.select("a[href*=fide.com]");
This Elements object should give you a list of all links that link to anything containing the text fide.com, but you probably only want the first one, eg:
Element fideurl = doc.selectFirst("a[href=*=fide.com]");
From that point on, I don't want to write all the code for you, but hopefully this answer serves as a good starting point!
You can get the ID alone by calling the text()
method on your Element object, but You can also get the link itself by just calling Element.attr('href')
The css selector you can use to get the other value is
div#main-col table.contentpaneopen tbody tr td table tbody tr td table tbody tr:nth-of-type(4) td table tbody tr td:first-of-type
, which will get you the std score specifically, at least with standard css, so this should work with jsoup as well.
Upvotes: 0