rubyist
rubyist

Reputation: 409

Scraping content from html page

I'm using nokogiri to scrape web pages. The structure of the page is made of an unordered list containing multiple list items each of which has a link, an image and text, all contained in a div.

I'm trying to find clean way to extract the elements in each list item so I can have each li contained in an array or hash like so:

li[0] = ['Acme co 1', 'image1.png', 'Customer 1 details']
li[1] = ['Acme co 2', 'image2.png', 'Customer 2 details'] 

At the moment I get all the elements in one go then store them in separate arrays. Is there a better, more idiomatic way of doing this?

This is the code atm:

data = Nokogiri::HTML(html)
images = []
name = []
data.css('ul li img').each {|l| images << l}
data.css('ul li a').each {|a| names << a.text }

This is the html I'm working from:

<ul class="customers">
  <li>
    <div>
     <a href='#' class="company-name"> Acme co 1 </a>

      <div class="customer-image">
        <img src="image1.png"/>
      </div>

     <div class=" customer-description">
       Cusomter 1 details
     </div>
    </div>

   </li>

   <li>
     <div>
       <a href='#' class="company-name"> Acme co 2</a>
        <div class="customer-image">
         <img src="image1.png"/>
        </div>

       <div class=" customer-description">
         Customer 2 details
       </div>
     </div>

   </li>

</ul>

Thanks

Upvotes: 1

Views: 310

Answers (2)

moveson
moveson

Reputation: 5213

Assuming the code you have is giving you what you want, I wouldn't try to rewrite anything significant. You can be more brief and idiomatic by replacing your #each methods with #map:

data = Nokogiri::HTML(html)
images = data.css('ul li img')
names = data.css('ul li a').map(&:text)

Upvotes: 2

Tom Lord
Tom Lord

Reputation: 28285

data = Nokogiri::HTML(html)
images = data.css('ul li img')
names = data.css('ul li a').map(&:text)

This simplifies your code slightly, but your original version wasn't too bad.

And my simplification may not generalise if you are, for example, scraping images from multiple regions on the page! In which case, reverting back to something like your original may be fine.

Upvotes: 1

Related Questions