Reputation: 13
I have got this XML file which is not well formatted but I need to to parse this anyhow.I have tried all parsing options say DOM / SAX parsing but still could not achieve it, Tried this
So could anyone please guide me how do I parse such not well formatted xml data.
Here's the XML file
<?xml version="1.0" ?>
<Employee>
<Name> Jack
<EMPID> EMP001 <Address> 12 CA, USA</Address>
</EMPID>
</Name>
</Employee>
Parsing Code
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory
.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document doc = docBuilder.parse(new File(
"new.xml"));
// normalize text representation
doc.getDocumentElement().normalize();
System.out.println("Root element of the doc is "
+ doc.getDocumentElement().getNodeName());
NodeList listOfPersons = doc.getElementsByTagName("NAME");
int totalPersons = listOfPersons.getLength();
for (int s = 0; s < listOfPersons.getLength(); s++) {
Node firstPersonNode = listOfPersons.item(s);
if (firstPersonNode.getNodeType() == Node.ELEMENT_NODE) {
Element firstPersonElement = (Element) firstPersonNode;
// -------
NodeList firstNameList = firstPersonElement
.getElementsByTagName("Name");
Element firstNameElement = (Element) firstNameList.item(0);
NodeList textFNList = firstNameElement.getChildNodes();
System.out
.println("Name : "
+ ((Node) textFNList.item(0))
.getNodeValue().trim());
// -------
NodeList lastNameList = firstPersonElement
.getElementsByTagName("EMPID");
Element lastNameElement = (Element) lastNameList.item(0);
NodeList textLNList = lastNameElement.getChildNodes();
System.out
.println("ID : "
+ ((Node) textLNList.item(0))
.getNodeValue().trim());
// ----
NodeList ageList = firstPersonElement
.getElementsByTagName("Address");
Element ageElement = (Element) ageList.item(0);
NodeList textAgeList = ageElement.getChildNodes();
System.out.println("Address : "
+ ((Node) textAgeList.item(0)).getNodeValue()
.trim());
}
}
} catch (SAXParseException err) {
System.out.println("** Parsing error" + ", line "
+ err.getLineNumber() + ", uri " + err.getSystemId());
System.out.println(" " + err.getMessage());
} catch (SAXException e) {
Exception x = e.getException();
((x == null) ? e : x).printStackTrace();
} catch (Throwable t) {
t.printStackTrace();
}
Upvotes: 0
Views: 349
Reputation: 247
Try to parse the XML after you have corrected it. A well-formatted XML only have 1 value per XML-element, but may have multiple attributes:
<employee attribute="attrvalue">value-string or xml-element, not both</employee>
So a suggestion to how your XML should look would be as follows:
<?xml version="1.0" ?>
<Employee>
<Name> Jack </Name>
<EMPID> EMP001 </EMPID>
<Address> 12 CA, USA</Address>
</Employee>
EDIT: However, if you are recieving the XML from a source that you cannot change, then there is basically only one option left for you - manually parsing the XML after converting it to a regular java String.
Try to utilize the different string-methods such as substring, indexof etc. Example:
String empidStartElement = "<empid>";
String nameStartElement = "<name>";
String nameEndElement = empidStartElement;
String xml = "<employee><name>Jack<empid>emp001</empid></name></employee>";
Integer nameStartPosition = xml.indexOf(nameStartElement)+nameStartElement.length;
Integer nameEndPosition = xml.indexOf(nameEndElement);
String name = xml.substring(nameStartPosition, nameEndPosition);
Upvotes: 1
Reputation: 1
Just change the line:
NodeList listOfPersons = getElementsByTagName("NAME");
to:
NodeList listOfPersons = doc.getChildNodes();
output:
Root element of the doc is Employee
Name : Jack
ID : EMP001
Address : 12 CA, USA
Upvotes: 0
Reputation: 52185
Since the XML is, in itself broken XML parsing will fail.
Assuming that, despite broken, the XML file will always have that layout, you could use regular expressions to extract the data.
String str = "<?xml version=\"1.0\" ?>\n" +
"<Employee>\n" +
"<Name> Jack\n" +
"<EMPID> EMP001 <Address> 12 CA, USA</Address> \n" +
"</EMPID>\n" +
"</Name>\n" +
"</Employee>";
str = str.replaceAll("\\n", "");
Pattern p = Pattern.compile("<Name>(.+?)<EMPID>(.+?)<Address>(.+?)</Address>");
Matcher m = p.matcher(str);
while(m.find())
{
System.out.println("Name: " + m.group(1) + " EMPID: " + m.group(2) + " Address: " + m.group(3));
}
Yields:
Name: Jack EMPID: EMP001 Address: 12 CA, USA
What does this pattern do:
<Name>
will match the Name
tag.(.+?)
will match what text follows the <Name>
tag but will stop matching the moment it finds <EMPID>
, since it is not a greedy pattern due to the ?
added after the greedy operator +
(this will be matched by the next section of the pattern. Also in this section, anything which matches will be placed in a group which can be later accessed.<EMPID>
tag.<EMPID>
tag has been matched, a process similar to step 2 will take place and the matched content will be placed in another group.<Address>
<Address>
and </Address>
tags and once again, anything that matches will be placed in a group.Once that the regular expression parses the string, I am accessing the groups and printing their values. As an extra step, I am removing any new line characters to process the string as a one liner.
An introductory tutorial on regular expressions can be found here.
Upvotes: 2
Reputation: 136102
It is not well-formatted but it is well-formed http://en.wikipedia.org/wiki/Well-formed_document, you can parse it with any parser.
Upvotes: 1