domi91c
domi91c

Reputation: 2053

Parsing multiple lists in HTML file with Nokogiri

I'm trying to learn scripting with Ruby, and this is my first problem.

I have an HTML file which contains states and their cities. I need to be able to access the cities and know which state they belong to in my Ruby code, so I plan on parsing the HTML and creating a hash for each city, like this: {New York => New York City}.

I'm attempting to use Nokogiri, which I'm just learning now.

  <h4>State</h4>
  <ul>
    <li>city</li>
    <li>city</li>
    <li>city</li>
  </ul>
  <h4>State</h4>
  <ul>
    <li>city</li>
    <li>city</li>
    <li>city</li>
  </ul>
  <h4>State</h4>
  <ul>
    <li>city</li>
    <li>city</li>
    <li>city</li>
  </ul>

I'm using this to get the states into an array:

require 'rubygems'
require 'nokogiri'

page = Nokogiri::HTML(open("to_parse.html"))

states = Array.new(100), index = 0

page.css('h4').each do |s|

    states[index]   = s.text
    puts states[index]

    index += 1
end

This actually doesn't really help; I need to figure out how I can get Nokogiri to parse the elements of each list into hashes containing the city and its state. I'm not sure how to have a loop break when it finishes the city list of one state, and create a new set of hashes for the city list of the next state.

I'm thinking I'll have to create a hash for each list element and store the text of the h4 tag for that list inside each hash, so I know which state the city belongs to. Which is what I'm not sure how to do.

Feel free to offer some advice on refactoring what I've got, as I know it could be done better.

Upvotes: 0

Views: 628

Answers (1)

Mark Thomas
Mark Thomas

Reputation: 37527

XPath selectors can help you out here.

states = doc.css('li').map do |city|
  state = city.xpath('../preceding-sibling::h4[1]')
  [city.text, state.text]
end.to_h

#=> {'city' => 'State', ...}

This grabs all the li city elements, then traces back to their state. (the XPath reads like so: .. = up one level, preceding-sibling::h4 = the preceding h4 elements, [1] = the first such element)

Some comments on your code: In Ruby, you don't need to initialize arrays, and with the Enumerable methods like map you never need to track index variables in loops.

Note that the final to_h only works in Ruby 2.1 or greater.

Upvotes: 1

Related Questions