Misha  Sunseev
Misha Sunseev

Reputation: 37

Parse list to multi-dimensional array without a loop

I'm using Ruby and Nokogiri to parse HTML source and have it list items in a recognizable pattern in the following format:

<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
    <b>ParameterB</b>
    <i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>

<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
    <b>ParameterB</b>
    <i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>

and so on multiple times.

How can I make a multi-dimensional array with the required parameters in the following structure?

myarray = []
mystuff = Struct.new(:ParameterA, :ParameterB, :ParameterC)

Can't find out what kind of loop I can run here and how can I avoid parsing useless stuff.

Upvotes: 2

Views: 185

Answers (2)

the Tin Man
the Tin Man

Reputation: 160551

I'd use something like this:

require 'nokogiri'
require 'ostruct'

doc = Nokogiri::HTML(<<EOT)
<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
    <b>ParameterB</b>
    <i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>

<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
    <b>ParameterB</b>
    <i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>
EOT

mystuff = doc.search('small.y').map { |span_y|
  [
    span_y.content,
    span_y.next_element.at('b').content,
    span_y.next_element.at('i') ? span_y.next_element.at('i').content : nil
  ]
}

pp mystuff

Which looks like:

[
  [
    "ParameterA",
    "ParameterB",
    "Possible ParameterC"
  ],
  [
    "ParameterA",
    "ParameterB",
    "Possible ParameterC"
  ]
]

Upvotes: 0

Jonah
Jonah

Reputation: 16202

I was able to solve this with a regexp which gives me the correct multi-dimensional array as output:

[["ParameterA", "ParameterB", "Possible ParameterC"], ["ParameterA", "ParameterB", "Possible ParameterC"]]

Working code:

str = <<EOF
<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
<b>ParameterB</b>
    <i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>

<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
    <b>ParameterB</b>
    <i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>
EOF

m = str.scan(/<small [^>]+>([^<]+)<.*?<b>([^<]+)<\/b>\s+<i>([^<]+)<\/i>/m)
puts m.inspect

Upvotes: 1

Related Questions