Reputation: 37
I'm using Ruby and Nokogiri to parse HTML source and have it list items in a recognizable pattern in the following format:
<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
<b>ParameterB</b>
<i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>
<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
<b>ParameterB</b>
<i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>
and so on multiple times.
How can I make a multi-dimensional array with the required parameters in the following structure?
myarray = []
mystuff = Struct.new(:ParameterA, :ParameterB, :ParameterC)
Can't find out what kind of loop I can run here and how can I avoid parsing useless stuff.
Upvotes: 2
Views: 185
Reputation: 160551
I'd use something like this:
require 'nokogiri'
require 'ostruct'
doc = Nokogiri::HTML(<<EOT)
<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
<b>ParameterB</b>
<i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>
<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
<b>ParameterB</b>
<i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>
EOT
mystuff = doc.search('small.y').map { |span_y|
[
span_y.content,
span_y.next_element.at('b').content,
span_y.next_element.at('i') ? span_y.next_element.at('i').content : nil
]
}
pp mystuff
Which looks like:
[
[
"ParameterA",
"ParameterB",
"Possible ParameterC"
],
[
"ParameterA",
"ParameterB",
"Possible ParameterC"
]
]
Upvotes: 0
Reputation: 16202
I was able to solve this with a regexp which gives me the correct multi-dimensional array as output:
[["ParameterA", "ParameterB", "Possible ParameterC"], ["ParameterA", "ParameterB", "Possible ParameterC"]]
Working code:
str = <<EOF
<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
<b>ParameterB</b>
<i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>
<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
<b>ParameterB</b>
<i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>
EOF
m = str.scan(/<small [^>]+>([^<]+)<.*?<b>([^<]+)<\/b>\s+<i>([^<]+)<\/i>/m)
puts m.inspect
Upvotes: 1