ForceMagic
ForceMagic

Reputation: 506

Nokogiri is returning to root element. Why?

I'm sorry, this seems like an off-topic question, but just give me 2 minutes. I'm probably missing a tiny detail, but I'm going crazy over this tiny piece of code:

parsed = Nokogiri::HTML(open(url))

fullmeta = parsed.xpath('//*[@id="profile_top"]')[0]
if fullmeta.inner_html.to_s.include? "image"
    meta = fullmeta.xpath('//span[4]')[0]
else
    meta = fullmeta.xpath('//span[3]')[0]
end

puts meta.inner_html                    # This seems fine
puts meta.xpath('//a[1]')[0].inner_html # !!!

The line marked with !!! is the culprit. Something is making that line re-do the XPath from the root element of parsed! I have that variable declared a couple XPaths before. What. is. going. on. here? I've been sitting on this code for like an hour! (DuckDuckGo'd half the Internet already)

If you want an XML, just use any FanFiction story page. I'm writing an API for that in rails, but that's not really an important fact here.


Just in case anyone tries it out with FanFiction, this is what I get:

Rated: <a class="xcontrast_txt" href="https://www.fictionratings.com/" target="rating">Fiction  T</a> - English - Humor/Adventure - Chapters: 15   - Words: 55,643 - Reviews: <a href="/r/12135694/">22</a> - Favs: 5 - Follows: 8 - Updated: <span data-xutime="1501553985">17h</span> - Published: <span data-xutime="1473081239">9/5/2016</span> - id: 12135694 
FanFiction

The last line should say Fiction T

Upvotes: 0

Views: 137

Answers (1)

Mark Thomas
Mark Thomas

Reputation: 37507

Using the full power of XPath often means you don't have to stop and iterate, you can just grab what you want directly with a single expression. This allows you to externalize, store in variables, or otherwise organize your expressions and maintain them more easily, even if the XML changes. With XPath you can even incorporate some logic in the expressions.

Are you trying to get the rating of a story? Note that there's a target=rating attribute, so you can key off of that, rather than counting span elements.

doc.xpath('//*[@id="profile_top"]/span/a[@target="rating"]/text()')

#=> "Fiction M"

Another thing I'd recommend is to use either HTTParty or Mechanize, if you aren't already. They have different strengths. HTTParty gives you an easy way to create a nice object-oriented client with fetching and parsing. Mechanize focuses on scraping but it has Nokogiri built in and you can access the underlying Nokogiri document and just start executing XPath on it.

Edit: Adding a couple others from your comment below.

language = doc.xpath('//*[@id="profile_top"]/span[a[@target="rating"]]/text()').to_s.split(' - ')[1]
#=> "English"

Note that the brackets [] can be read as "which contains," so we are looking for the span which contains a link with a target of rating. This way you don't need to count spans, which is more brittle.

genres = doc.xpath('//*[@id="profile_top"]/span[a[@target="rating"]]/text()').to_s.split(' - ')[2].split('/')
#=> ["Humor", "Adventure"]

id = doc.xpath('//*[@id="profile_top"]/span[a[@target="rating"]]/text()').to_s.split(' - ')[5].split(': ')
#=> "12596791"

published = DateTime.strptime(doc.xpath('//*[@id="profile_top"]//span/@data-xutime').first.value, '%s')
#=> 2017-08-01T20:03:19+00:00

And so on. I recommend putting the XPaths in something like a hash so you can refer to the more descriptive xpath_for[:rating] instead of hardcoding them all throughout the code.

Upvotes: 1

Related Questions