bkone
bkone

Reputation: 251

How do I count a sub string using a regex in ruby?

I have a very large xml file which I load as a string so my XML lools like

<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
  <article ID="5756261" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>

I want to count the number of occurrences the string

article ID="5705641" contentstatus="Changed"

how can I convert the ID to a regex

Here is what I have tried doing

searchstr = 'article ID=\"/[1-9]{7}/\" contentstatus=\"Changed\"'
count = ((xml.scan(searchstr).length)).to_s
puts count

Please let me know how can I achieve this?

Thanks

Upvotes: 1

Views: 943

Answers (4)

the Tin Man
the Tin Man

Reputation: 160631

Nokogiri is my recommended Ruby XML parser. It's very robust, and is probably the standard for the language now.

I added two more "articles" to show how easily you can find and manipulate the contents, without having to rely on a regex.

require 'nokogiri'

xml =<<EOT
<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
  <article ID="5756261" contentstatus="Changed"   doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
  <article ID="5756262" contentstatus="Unchanged" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
  <article ID="5756263" contentstatus="Changed"   doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>
EOT

doc = Nokogiri::XML(xml)
puts doc.search('//article[@contentstatus="Changed"]').size.to_s + ' found'

puts doc.search('//article[@contentstatus="Changed"]').map{ |n| "#{ n['ID'] } #{ n['doi'] } #{ n['idID'] }" }

>> 2 found
>> 5756261 10.1109/TNB.2011.2145270 0b0000648151d8ca
>> 5756263 10.1109/TNB.2011.2145270 0b0000648151d8ca

The problem with using regex with HTML or XML, is they'll break really easily if the XML changes, or if your XML comes from different sources or is malformed. Regex was never designed to handle that sort of problem, but a parser was. You could have XML with line ends after every tag, or none at all, and the parser won't really care as long as the XML is well-formed. A good parser, like Nokogiri can even do fixups if the XML is broken, in order to try to make sense of it, but

Upvotes: 2

michaeltomer
michaeltomer

Reputation: 455

I'm going to go out on a limb and guess that you're new to Ruby. First, it's not necessary to convert count into a string to puts it. Puts automatically calls to_s on anything you send to it.

Second, it's rarely a good idea to handle XML with string manipulation. I would strongly advise that you use a full fledged XML parser such as Nokogiri.

That said, you can't embed a regex in a string like that. The entire query string would need to be a regex.

Something like

/article ID="[1-9]{7}" contentstatus="Changed"/

Quotation marks aren't special characters in a regex, so you don't need to escape them.

When in doubt about regex in Ruby, I recommend checking out Rubular.com.

And once again, I can't emphasize enough that I really don't condone trying to manipulate XML via regex. Nokogiri will make dealing with XML a billion times easier and more reliable.

Upvotes: 4

Kobi
Kobi

Reputation: 138147

If XPath is an option, it is a preferred way of selecting XML elements. You can use the selector:

//article[@contentstatus="Changed"]

Or, if possible:

count(//article[@contentstatus="Changed"])

Upvotes: 2

eykanal
eykanal

Reputation: 27077

Your current string looks almost perfect to me, just remove the errant / from around the numbers:

searchstr = 'article ID=\"[1-9]{7}\" contentstatus=\"Changed\"'

Upvotes: 1

Related Questions