Erik Mingo
Erik Mingo

Reputation: 55

In java trying to extract XMLNS using a Regexpression

I have been trying for a few hours to get this right, and I really can't seem to do it...

Given a string

"xmlns:oai-identifier=\"http://www.openarchives.org/OAI/2.0/oai-identifier\""

what is the correct expression to "save" the http://www.openarchives.org/OAI/2.0/oai-identifier bit?

Thanks in advance, really having trouble getting this right.

String validXML = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><feed "
            + "xmlns:oai-identifier=\"http://www.openarchives.org/OAI/2.0/oai-identifier\" "
            + "xmlns:mingo-identifier=\"http://www.google.com\" "
            + "xmlns:abeve-identifier=\"http://www.news.ycombinator.org/OAI/2.0/oai-identifier\">"
            + "</feed>";

    Pattern p = Pattern.compile(".*\\\"(.*)\\\".*");
    Matcher m = p.matcher(validXML);
    System.out.println(m.group(1));

Is not printing out anything. Be aware that this attempt was just to get the string inside the quotes, I was going to worry about the other part once I got that working... To bad I never got that working. Thanks

Upvotes: 2

Views: 935

Answers (3)

helderdarocha
helderdarocha

Reputation: 23627

Since you are reading XML, you might be using DOM, so you can extract the namespace from the prefix name using lookupNamespaceURI() once you parse the document with the setNamespaceAware() option set to true:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
Document doc = factory.newDocumentBuilder().parse(new InputSource(new StringReader(validXML)));

String namespace = doc.lookupNamespaceURI("oai-identifier");

It's simpler and you don't have to do any string parsing.

Upvotes: 2

tmanion
tmanion

Reputation: 401

Regular Expressions are so expensive - don't use them when you don't need to!! There are a million other ways to parse a string.

String validXml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><feed "
        + "xmlns:oai-identifier=\"http://www.openarchives.org/OAI/2.0/oai-identifier\" "
        + "xmlns:mingo-identifier=\"http://www.google.com\" "
        + "xmlns:abeve-identifier=\"http://www.news.ycombinator.org/OAI/2.0/oai-identifier\">"
        + "</feed>";
String start = "xmlns:oai-identifier=\"";
String end = "\" ";
int location = validXml.indexOf(start);
String result;
if (location > 0) {
    result = validXml.substring(location + start.length(), validXml.length());
    int endIndex = result.indexOf(end);
    if (endIndex > 0) {
        result = result.substring(0, endIndex);
    }
    else {
        throw new Exception("Could not find end!");
    }


}
else {
    throw new Exception("Could not find start!");
}
System.out.println(result);

Upvotes: 2

ATG
ATG

Reputation: 1707

I think the problem might be that the first .* in your regular expression is too eager and matching more characters than you'd like.

Try changing ".*\\\"(.*)\\\".*" to be "xmlns.*=\"(.*)\".*" and see whether that works.

If it doesn't work at first, you can also try re-instating the quote escaping. Off the top of my head, I think you don't need them escaping, but I'm not 100% sure.

Note also that this will only match a single namespace declaration, not each one in the validXML variable in your example. You'll have to split the string in order to use this on an arbitrary number of xmlns:.*= attributes.

Upvotes: 2

Related Questions