minirasher
minirasher

Reputation: 31

extracting attribute value in XML using regex

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE ... ]> 
<abc-config version="THIS" id="abc">
...
</abc-config>

Hi all,

In the code above, how can I extract the value of version attribute using Regex in Groovy/Java?

Thanks.

Upvotes: 3

Views: 3435

Answers (3)

tim_yates
tim_yates

Reputation: 171114

I know you asked for a regex, but what's wrong with this in Groovy?

Assuming the xml is something like:

def xml= '''<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<!DOCTYPE abc-config>
<abc-config version="THIS" id="abc">
  <node></node>
</abc-config>'''

Then I can parse it with:

def n = new XmlSlurper().parseText( xml )

And then this line:

println n.@version

Prints out "THIS"


If you are having problems with a more complex DOCTYPE failing to load, you can try disabling the DOCTYPE checker by either:

def parser = new XmlSlurper()
parser.setFeature( "http://apache.org/xml/features/nonvalidating/load-external-dtd", false )
parser.setFeature( "http://xml.org/sax/features/namespaces", false )
parser.parseText( xml )

or by using the constructor for XmlSlurper that takes 2 parameters so as to disable this checking

Upvotes: 2

user557597
user557597

Reputation:

Not a java regex, Perl regex...
/<\w+\s+[^>]*?(?<=\s)version\s*=\s*["'](.+?)["'][^>]*?\s*\/?>/sg

Note that this fails on many levels, I could fill the page with a proper regex, but I don't have the desire.

this fails too ...
/<\w+\s+[^>]*?(?<=\s)version\s*=\s*(".+?"|'.+?')[^>]*?\s*\/?>/sg

so does this
/<\w+\s+[^>]*?(?<=\s)version\s*=\s*(["'])(.+?)\1[^>]*?\s*\/?>/sg

Upvotes: 0

CanSpice
CanSpice

Reputation: 35808

A regex to handle this could be something like:

/<\?xml version="([0-9.]+)"/

I'll spare you one of the 10000 lectures about not using a regex to parse markup languages.

Edit: The One whose Name cannot be expressed in the Basic Multilingual Plane, He compelled me.

Upvotes: 2

Related Questions