D133p53
D133p53

Reputation: 59

Ruby nokogiri selecting multiple elements

everybody, this is probably has a simple answer which I am overlooking (I am still learning).

I am trying to scrape data from a website, I am specific after particular p elements, which are nested inside different elements, this is what the nested elements look like.

#ctl00_body_divSearchResult > div:nth-child(5) > div.expandable-box-content.expanded > p:nth-child(2)

#ctl00_body_divSearchResult > div:nth-child(16) > div.expandable-box-content.expanded > p:nth-child(2)

#ctl00_body_divSearchResult > div:nth-child(27) > div.expandable-box-content.expanded > p:nth-child(2)

#ctl00_body_divSearchResult > div:nth-child(38) > div.expandable-box-content.expanded > p:nth-child(2)

#ctl00_body_divSearchResult > div:nth-child(49) > div.expandable-box-content.expanded > p:nth-child(2)

Here are five examples from the same page, the first div:nth-child has different numbers, but the rest is consistent. I am after the individual p:nth-child(2) elements.

Using this code I can get the individual p elements,

numbers= agent.get(urlanzctr).css('#ctl00_body_divSearchResult > div:nth-child(5) > div.expandable-box-content > p:nth-child(2)').text

But I think it would be sloppy coding to go through and repeat this for each individual instances.

Upvotes: 0

Views: 395

Answers (1)

max pleaner
max pleaner

Reputation: 26758

Your agent.get(url).css(selector) approach is right to return an array.

Looking at your selectors, they're all of this structure:

#ctl00_body_divSearchResult >
div:nth-child(N) >
div.expandable-box-content.expanded >
p:nth-child(2)

The only variable being the N in div:nth-child.

You have the values 5, 16, 27, 38, 49

which is 11x + 5

So you could do something like this

def make_selector(n)
  "#ctl00_body_divSearchResult > "         +
  "div:nth-child(#{n}) > "                 +
  "div.expandable-box-content.expanded > " +
  "p:nth-child(2)"
end

def get_matches(n)
  agent.get(url).css(make_selector(n))
end

starting_idx = 5
current_matches = get_matches(starting_idx)
all_matches = []

until current_matches.empty?
  all_matches.concat(current_matches)
  current_matches = get_matches(starting_idx + 11)
end

puts all_matches.length

You might also be able to skip out the intermediary selectors

i.e. maybe just .expandable-box-content.exampled > p would work; i have no idea what the page structure looks like.

Upvotes: 1

Related Questions