Reputation: 251
I have a very large xml file which I load as a string so my XML lools like
<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
<article ID="5756261" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>
I want to count the number of occurrences the string
article ID="5705641" contentstatus="Changed"
how can I convert the ID to a regex
Here is what I have tried doing
searchstr = 'article ID=\"/[1-9]{7}/\" contentstatus=\"Changed\"'
count = ((xml.scan(searchstr).length)).to_s
puts count
Please let me know how can I achieve this?
Thanks
Upvotes: 1
Views: 943
Reputation: 160631
Nokogiri is my recommended Ruby XML parser. It's very robust, and is probably the standard for the language now.
I added two more "articles" to show how easily you can find and manipulate the contents, without having to rely on a regex.
require 'nokogiri'
xml =<<EOT
<publication ID="7728" contentstatus="Unchanged" idID="0b000064800e9e39">
<volume contentstatus="Unchanged" idID="0b0000648151c35d">
<article ID="5756261" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
<article ID="5756262" contentstatus="Unchanged" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
<article ID="5756263" contentstatus="Changed" doi="10.1109/TNB.2011.2145270" idID="0b0000648151d8ca"/>
</volume>
EOT
doc = Nokogiri::XML(xml)
puts doc.search('//article[@contentstatus="Changed"]').size.to_s + ' found'
puts doc.search('//article[@contentstatus="Changed"]').map{ |n| "#{ n['ID'] } #{ n['doi'] } #{ n['idID'] }" }
>> 2 found
>> 5756261 10.1109/TNB.2011.2145270 0b0000648151d8ca
>> 5756263 10.1109/TNB.2011.2145270 0b0000648151d8ca
The problem with using regex with HTML or XML, is they'll break really easily if the XML changes, or if your XML comes from different sources or is malformed. Regex was never designed to handle that sort of problem, but a parser was. You could have XML with line ends after every tag, or none at all, and the parser won't really care as long as the XML is well-formed. A good parser, like Nokogiri can even do fixups if the XML is broken, in order to try to make sense of it, but
Upvotes: 2
Reputation: 455
I'm going to go out on a limb and guess that you're new to Ruby. First, it's not necessary to convert count into a string to puts it. Puts automatically calls to_s on anything you send to it.
Second, it's rarely a good idea to handle XML with string manipulation. I would strongly advise that you use a full fledged XML parser such as Nokogiri.
That said, you can't embed a regex in a string like that. The entire query string would need to be a regex.
Something like
/article ID="[1-9]{7}" contentstatus="Changed"/
Quotation marks aren't special characters in a regex, so you don't need to escape them.
When in doubt about regex in Ruby, I recommend checking out Rubular.com.
And once again, I can't emphasize enough that I really don't condone trying to manipulate XML via regex. Nokogiri will make dealing with XML a billion times easier and more reliable.
Upvotes: 4
Reputation: 138147
If XPath is an option, it is a preferred way of selecting XML elements. You can use the selector:
//article[@contentstatus="Changed"]
Or, if possible:
count(//article[@contentstatus="Changed"])
Upvotes: 2
Reputation: 27077
Your current string looks almost perfect to me, just remove the errant /
from around the numbers:
searchstr = 'article ID=\"[1-9]{7}\" contentstatus=\"Changed\"'
Upvotes: 1