Reputation: 307
I have an xml file which contains text like this :
<text top="84" left="97" width="737" height="32" font="0">SmartFS-A Serverless Distributed File System for</text>
<text top="126" left="371" width="187" height="32" font="0">Smartphones</text>
<text top="217" left="253" width="424" height="15" font="1">Sonali Batra,Vijay Raghunathan and Mithun Kumar Rajendran</text>
<text top="237" left="325" width="281" height="13" font="2">School of Computer Science and Engineering</text>
I am trying to extract the first line using a regular expression as everything but font changes for each XML file. The Regex I am currently using but always returns a false is:
if (xml.matches("<text top=\"[0-9]*\" left=\"[0-9]*\" width=\"[0-9]*\" height=\"[0-9]*\" font=\"0\">"))
I have tested the expression in http://gskinner.com/RegExr/ and it detects the line.
Upvotes: 0
Views: 104
Reputation: 122394
If you want to parse XML then you should use an XML parser. Here is an example using the DOM and XPath support built in to Java (imports and exception handling omitted):
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
DocumentBuilder builder = dbf.newDocumentBuilder();
// use parse(File) if you have the XML on disk rather than in a String
Document doc = builder.parse(new InputSource(new StringReader(xml)));
XPath xp = XPathFactory.newInstance().newXPath();
NodeList font0Texts = (NodeList)xp.evaluate("//text[@font = '0']", doc,
XPathConstants.NODESET);
Note that for this to work xml
must be well formed, in particular it must have a single root-level element. The example you give in the question is a document fragment, not a complete document, because it has more than one root-level element. If this is a real complete example then you'll need something a little more involved to parse it:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
DocumentBuilder builder = dbf.newDocumentBuilder();
Document doc = builder.newDocument();
DocumentFragment fragment = doc.createDocumentFragment();
LSInput input = ((DOMImplementationLS)doc.getImplementation()).createLSInput();
input.setStringData(xml);
LSParser parser = ((DOMImplementationLS)doc.getImplementation()).createLSParser(
LSParser.MODE_SYNCHRONOUS, null);
parser.parseWithContext(input, fragment, LSParser.ACTION_REPLACE_CHILDREN);
You can then use the fragment
to evaluate XPath expressions:
XPath xp = XPathFactory.newInstance().newXPath();
NodeList font0Texts = (NodeList)xp.evaluate("//text[@font = '0']", fragment,
XPathConstants.NODESET);
Upvotes: 1
Reputation: 32969
From what you are stating I suggest you use regex Matcher
String regex = "^<text top=\"[0-9]*\" left=\"[0-9]*\" width=\"[0-9]*\" "+
"height=\"[0-9]*\" font=\"0\">";
Pattern patter = Pattern.compile(regex);
Matcher matcher = pattern.matcher(xml);
if (matcher.find()){
...
}
This will result in true of your xml starts with the font element.
You might also want to use a regex as follows to capture the font:
"^<text top=\"[0-9]*\" left=\"[0-9]*\" width=\"[0-9]*\" height=\"[0-9]*\" "+
"font=\"0\">([^<]*)<"
Upvotes: 0
Reputation: 22243
The matches
method checks for the whole string to match the regex.
Use
xml.matches(".*<text top=\"[0-9]*\" left=\"[0-9]*\" width=\"[0-9]*\" height=\"[0-9]*\" font=\"0\">.*")
otherwise your pattern will be evaluated as
^<text top=\"[0-9]*\" left=\"[0-9]*\" width=\"[0-9]*\" height=\"[0-9]*\" font=\"0\">$
which is never found.
Side note: I really recommend to use a xml parse to do these things.
Upvotes: 3