Reputation: 780
I am trying to parse an HTML page using Jsoup and founding some weird issues. The page is: http://www.filmaffinity.com/en/film290741.html and as you can see is not well formed. It has some problems that could I guess affect the parsing. Through Firebug and Chrome I have obtained the XPath to the element I am looking for (the 5.8 rate in the page).
Chrome points to:
/html/body/table[2]/tbody/tr/td[2]/table/tbody/tr[1]/td/table[1]/tbody/tr/td[2]/table/tbody/tr[2]/td
While Firebug points to:
/html/body/table[2]/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[2]/td
The only difference is the 1 in Chrome that do not show in Firebug. I have manually verified the path and it is correct.
Then I have converted the xpath to CSS query in Jsoup as, to later extract the specific element:
Element rate=doc.select("html body table:nth-child(2) tbody tr td:nth-child(2) table tbody tr td table tbody tr td:nth-child(2) table tbody tr:nth-child(2) td")
The execution of the code does not position me properly but to an element that Firebug refers, in XPath, as:
wrong:/html/body/table[2]/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr[15]/td[2]
right:/html/body/table[2]/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[2]/td
Taking it from the end the first difference is:
/td/table/tbody/tr[15]/td[2]
where it takes the first element and not the second.
Is there any way to fix this kind of issues? Is the problem related to the html being not well-formed or am I missing some other Jsoup technique that I could use to workaround this?
I chose Jsoup because it was supposed to be able to deal with not well-formed Html. Am I too demanding?
Are there any alternatives to Jsoup that could deal with this kind of problems?
Upvotes: 0
Views: 1164
Reputation: 780
I have not been able to figure out a "scientific" solution. Instead I have search for other ways to define the element (based on different attributes and elements).
It is not an elegant solution but it works.
It's excellent that JSoup supports so many options for Selector. The only drawback is that the supposedly advance capability to deal with not well formed HTML is not so advanced.
Upvotes: 0
Reputation: 10007
You were almost there!
The problem is that (as you alluded to) the expression you've supplied to select()
matches two elements. I checked this by executing a JQuery in the Chrome dev console.
select()
returns an Elements
so you could just access rate.get(1)
but that doesn't really read very well. So instead, you can add a little bit more refinement to your query so that it gets the rating you're after:
Element rate=doc.select("html body table:nth-child(2) tbody tr td:nth-child(2) table tbody tr td table tbody tr td:nth-child(2) table tbody tr:nth-child(2) td[align=center]").first();
Which works because the other td
isn't centred.
Upvotes: 1