terrorista
terrorista

Reputation: 227

How can I scrape HTML with Nokogiri without tags?

I need to parse a local HTML file using Nokogiri, but the HTML doesn't have any <div>s with classes. It starts with text.

This is the HTML:

high prices in <a href="Example 1">Example 1</a><br>
low prices in <a href="Example 2">Example 2</a><br>

In this case I just need to get "high" and "low", and "example 1", and "example 2".

How can I get the text, with no elements? From the tutorials I saw, it needs some <div class= ...> to get the text.

doc.xpath('//a/@href').each do |node|   #get performance indicators
      link = node.text

      @test << Entry2.new(link)

    end

    @title = doc.xpath('//p').text.scan(/^(high|low)/)

My view:

   <% @test.each do |entry| %>


    <p>  <%= entry.link %></p>

<% end %>


<% @title.each do |f| %>
    <p>  <%= f %></p>


<% end %>

And the output is like this:

Example 1Example 2

[["high"], ["low"]]

It's listing all at the same time instead of one by one. How can I change my Nokogiri code to look like this in the output?

high prices in Example 1
low prices in Example 2

Upvotes: 1

Views: 751

Answers (1)

smathy
smathy

Reputation: 27961

Well, Nokogiri will wrap that string in an implicit <html><body><p>... so the text will be in a single <p>

So yes, you will be able to get the links in a structured form with:

doc.xpath "//a"

The "high" and "low" strings will be in a single blob of text. You will probably need to pull them out with some regex which will depend a lot on your requirements and data, but here's the regex for what you're showing and asking for:

doc.xpath('//p').text.scan(/^(high|low)/)

I can't be sure how helpful that will specifically be with your actual requirements, but hopefully it gives you a direction to take.

Upvotes: 3

Related Questions