mental
mental

Reputation: 1

Jsoup parsing - Java

I'm trying to parse data from IMDB and Rotten Tomatoes using Jsoup. I've gotten most of the info I need but there is some data I don't know how to get.

For example in the movie The Expendables 2 I need to get the number for User Reviews: 313,393 (just the number) but using something like

Elements links19 = doc5.select("p[class=critic_stats]");

gets the whole p class="critic_stats"

Average Rating: 5.8/10 Reviews Counted: 123 Fresh: 80 | Rotten: 43 Average Rating: 5.3/10 Critic Reviews: 23 Fresh: 13 | Rotten: 10 liked it Average Rating: 3.7/5 User Ratings: 313,393

For the same movie in IMDB I'm trying to get:

Country: USA | Bulgaria
Language: English
Release Date: 17 August 2012 (USA) 
Sound Mix: Dolby Digital | Datasat
Color: Color
Aspect Ratio: 2.35 : 1 

Again I only need the values but everything is in

<div class="txt-block">
<h4 class="inline">

and I don't know if there's any way to get specific data based on for example

<h4 class="inline">Sound Mix:</h4>

<itemprop='url'>Dolby Digital</a>
<itemprop='url'>Datasat</a>

Any ideas on how to get those .. I'm not sure what they are, child attributes?

UPDATE:

OK so the first worked with Pattern pattern = Pattern.compile("[0-9]*\\.?,?[0-9]+");

I have a few more questions

I need to get "/title/tt1764651/?ref_=fn_tt_tt_1" or even better the whole http

from

<tr class="findResult odd"> <td class="primary_photo"> <a href="/title/tt1764651/?ref_=fn_tt_tt_1" ><img src="http://ia.media-imdb.com/images/M/MV5BMTQzODkwNDQxNV5BMl5BanBnXkFtZTcwNTQ1ODAxOA@@._V1_SX32_CR0,0,32,44_AL_.jpg" /></a> </td> <td class="result_text"> <a href="/title/tt1764651/?ref_=fn_tt_tt_1" >The Expendables 2</a> (2012) </td></tr></table>

I tried

Elements links = doc.select("table[class=findList"); String a= links.attr("abs:href");

but it doesn't work, any ideas?

Also I used

Document doc6= Jsoup.parse(new URL(url6).openStream(), "ISO-8859-1", url6);

to get the Also Known As but on the Bulgarian title for example I get

Bulgaria (Bulgarian title) ???µ?????±?µ???????????µ

@Spectre

Upvotes: 0

Views: 420

Answers (1)

Spectre
Spectre

Reputation: 658

jsoup only provides manipulation of HTML. It can get you slight further than your attempt but not much:

Element el = doc.select("p.critic_status").last();

Will get you the element containing:

"Rotten: 10 liked it Average Rating: 3.7/5 User Ratings: 313,393"

This is because there are no further sub-elements to be able to dig into. To extract the data you want will require using other text-processing tools. For example using regular expressions:

Pattern pattern = Pattern.compile("User Ratings: ([0-9,]+)");
Matcher = pattern.matcher(el.text());
if(matcher.matches()) {
    String userRating = matcher.group(1); // "313,393"
}

For your second question, you need to chain selectors:

doc.select("div.txt-block");                   // <div class=<txt-block">

doc.select("div.txt-block > h4.inline");       // <h4 class="inline">
doc.select("div.txt-block > a[itemprop=url]"); // <a href="..." itemprop="url">

jsoup selectors are documented in the Selector class. Also Jsoup is modelled after jQuery so that documentation is also useful:

Upvotes: 2

Related Questions