Rocio Batres
Rocio Batres

Reputation: 1

Trying to grab a string from HTML with Nokogiri

I am a student working on my first CLI project with Ruby, and there is a website I am trying to scrape with Nokogiri. The contents of the website are not strictly organized into their own classes/id, but there is some information that I simply cannot figure out how to scrape. This is what it looks like:

    <p>
      <strong> First Aired:</strong>
      "2017 | "
      <strong> Episodes:</strong>
      " 24"
     <br>

I want to know if there is a way to scrape the string that comes after each "Episode:" element. The code I tried was

    doc = Nokogiri::HTML(open(https://www.techradar.com/best/best-anime))

    doc.css('p strong')[1].text <= and that got me the "Episodes:"

then i tried:

   doc.css('p strong')[1].next_element <= and it skipped the string and gave me "<br>

I also tried the .children method, but that also returned "Episodes:". I think I am confusing a lot of terms since these methods have no effect on the string. Is it even possible to grab that string with CSS? Lastly, if that were possible to grab, it there a way to grab only the strings after "Episodes:"?

I appreciate any help. I tried to do some research on Nokogiri and Css, but I think I am confusing a lot of things.

Upvotes: 0

Views: 120

Answers (1)

Kache
Kache

Reputation: 16687

HTML is heirarchical, so for all the elements you pasted, p is the parent, and the others are its children. This is especially apparent if the HTML is properly formatted and indented.

This means that you will find the " 24" under p, like this:

html = <<~STR
    <p>
      <strong> First Aired:</strong>
      "2017 | "
      <strong> Episodes:</strong>
      " 24"
     <br>
STR

html_doc = Nokogiri::HTML.parse(html)

p_element = html_doc.css('p')

p_element.children.map(&:name)
# => ["text", "strong", "text", "strong", "text", "br", "text"]

p_element.children.map(&:to_s)
# => [
#       "\n  ",
#       "<strong> First Aired:</strong>",
#       "\n  \"2017 | \"\n  ",
#       "<strong> Episodes:</strong>",
#       "\n  \" 24\"\n ",       <------------ this is what you wanted
#       "<br>",
#       "\n"
#   ]

p_element.children[4]
=> #(Text "\n  \" 24\"\n ")

If you want the sibling element immediately after one that has "Episodes:" in it, one way is to do this:

consecutive_pairs = p_element.children.each_cons(2)

_before, after = consecutive_pairs.detect do |before, after|
  before.text.include?("Episodes")
end

after
# => #(Text "\n  \" 24\"\n ")

Upvotes: 1

Related Questions