Reputation: 307

Regex works in online checkers but not java

I have an xml file which contains text like this :

<text top="84" left="97" width="737" height="32" font="0">SmartFS-A Serverless Distributed       File System for</text>
<text top="126" left="371" width="187" height="32" font="0">Smartphones</text>
<text top="217" left="253" width="424" height="15" font="1">Sonali Batra,Vijay Raghunathan and Mithun Kumar Rajendran</text>
<text top="237" left="325" width="281" height="13" font="2">School of Computer Science and Engineering</text>

I am trying to extract the first line using a regular expression as everything but font changes for each XML file. The Regex I am currently using but always returns a false is:

if (xml.matches("<text top=\"[0-9]*\" left=\"[0-9]*\" width=\"[0-9]*\" height=\"[0-9]*\" font=\"0\">"))

I have tested the expression in http://gskinner.com/RegExr/ and it detects the line.

Upvotes: 0

Answers (3)

Ian Roberts

Reputation: 122394

If you want to parse XML then you should use an XML parser. Here is an example using the DOM and XPath support built in to Java (imports and exception handling omitted):

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
DocumentBuilder builder = dbf.newDocumentBuilder();
// use parse(File) if you have the XML on disk rather than in a String
Document doc = builder.parse(new InputSource(new StringReader(xml)));

XPath xp = XPathFactory.newInstance().newXPath();
NodeList font0Texts = (NodeList)xp.evaluate("//text[@font = '0']", doc,
                                              XPathConstants.NODESET);

Note that for this to work xml must be well formed, in particular it must have a single root-level element. The example you give in the question is a document fragment, not a complete document, because it has more than one root-level element. If this is a real complete example then you'll need something a little more involved to parse it:

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
DocumentBuilder builder = dbf.newDocumentBuilder();
Document doc = builder.newDocument();

DocumentFragment fragment = doc.createDocumentFragment();

LSInput input = ((DOMImplementationLS)doc.getImplementation()).createLSInput();
input.setStringData(xml);
LSParser parser = ((DOMImplementationLS)doc.getImplementation()).createLSParser(
     LSParser.MODE_SYNCHRONOUS, null);

parser.parseWithContext(input, fragment, LSParser.ACTION_REPLACE_CHILDREN);

You can then use the fragment to evaluate XPath expressions:

XPath xp = XPathFactory.newInstance().newXPath();
NodeList font0Texts = (NodeList)xp.evaluate("//text[@font = '0']", fragment,
                                              XPathConstants.NODESET);

Upvotes: 1

John B

Reputation: 32969

From what you are stating I suggest you use regex Matcher

 String regex = "^<text top=\"[0-9]*\" left=\"[0-9]*\" width=\"[0-9]*\" "+
      "height=\"[0-9]*\" font=\"0\">";
 Pattern patter = Pattern.compile(regex);
 Matcher matcher = pattern.matcher(xml);
 if (matcher.find()){
    ...
 }

This will result in true of your xml starts with the font element.

You might also want to use a regex as follows to capture the font:

"^<text top=\"[0-9]*\" left=\"[0-9]*\" width=\"[0-9]*\" height=\"[0-9]*\" "+
       "font=\"0\">([^<]*)<"

Upvotes: 0

BackSlash

Reputation: 22243

The matches method checks for the whole string to match the regex.

Use

xml.matches(".*<text top=\"[0-9]*\" left=\"[0-9]*\" width=\"[0-9]*\" height=\"[0-9]*\" font=\"0\">.*")

otherwise your pattern will be evaluated as

^<text top=\"[0-9]*\" left=\"[0-9]*\" width=\"[0-9]*\" height=\"[0-9]*\" font=\"0\">$

which is never found.

Side note: I really recommend to use a xml parse to do these things.

Upvotes: 3

Regex works in online checkers but not java

Answers (3)

Related Questions