Picarus
Picarus

Reputation: 780

Parsing HTML(not well formed) with JSoup

I am trying to parse an HTML page using Jsoup and founding some weird issues. The page is: http://www.filmaffinity.com/en/film290741.html and as you can see is not well formed. It has some problems that could I guess affect the parsing. Through Firebug and Chrome I have obtained the XPath to the element I am looking for (the 5.8 rate in the page).

Then I have converted the xpath to CSS query in Jsoup as, to later extract the specific element:

Element rate=doc.select("html body table:nth-child(2) tbody tr td:nth-child(2) table tbody tr td table tbody tr td:nth-child(2) table tbody tr:nth-child(2) td") 

The execution of the code does not position me properly but to an element that Firebug refers, in XPath, as:

wrong:/html/body/table[2]/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr[15]/td[2]
    right:/html/body/table[2]/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[2]/td

Taking it from the end the first difference is:

/td/table/tbody/tr[15]/td[2]

where it takes the first element and not the second.

Is there any way to fix this kind of issues? Is the problem related to the html being not well-formed or am I missing some other Jsoup technique that I could use to workaround this?

I chose Jsoup because it was supposed to be able to deal with not well-formed Html. Am I too demanding?

Are there any alternatives to Jsoup that could deal with this kind of problems?

Upvotes: 0

Views: 1164

Answers (2)

Picarus
Picarus

Reputation: 780

I have not been able to figure out a "scientific" solution. Instead I have search for other ways to define the element (based on different attributes and elements).

It is not an elegant solution but it works.

It's excellent that JSoup supports so many options for Selector. The only drawback is that the supposedly advance capability to deal with not well formed HTML is not so advanced.

Upvotes: 0

millhouse
millhouse

Reputation: 10007

You were almost there!

The problem is that (as you alluded to) the expression you've supplied to select() matches two elements. I checked this by executing a JQuery in the Chrome dev console.

select() returns an Elements so you could just access rate.get(1) but that doesn't really read very well. So instead, you can add a little bit more refinement to your query so that it gets the rating you're after:

Element rate=doc.select("html body table:nth-child(2) tbody tr td:nth-child(2) table tbody tr td table tbody tr td:nth-child(2) table tbody tr:nth-child(2) td[align=center]").first();

Which works because the other td isn't centred.

Upvotes: 1

Related Questions