user1432151
user1432151

Reputation: 13

Parsing XML in Java

I have got this XML file which is not well formatted but I need to to parse this anyhow.I have tried all parsing options say DOM / SAX parsing but still could not achieve it, Tried this

So could anyone please guide me how do I parse such not well formatted xml data.

Here's the XML file

<?xml version="1.0" ?>
<Employee>
<Name> Jack
<EMPID> EMP001 <Address> 12 CA, USA</Address> 
</EMPID>
</Name>
</Employee>

Parsing Code

DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory
                    .newInstance();
            DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
            Document doc = docBuilder.parse(new File(
                    "new.xml"));

            // normalize text representation
            doc.getDocumentElement().normalize();
            System.out.println("Root element of the doc is "
                    + doc.getDocumentElement().getNodeName());

            NodeList listOfPersons = doc.getElementsByTagName("NAME");
            int totalPersons = listOfPersons.getLength();


            for (int s = 0; s < listOfPersons.getLength(); s++) {

                Node firstPersonNode = listOfPersons.item(s);
                if (firstPersonNode.getNodeType() == Node.ELEMENT_NODE) {

                    Element firstPersonElement = (Element) firstPersonNode;

                    // -------
                    NodeList firstNameList = firstPersonElement
                            .getElementsByTagName("Name");
                    Element firstNameElement = (Element) firstNameList.item(0);

                    NodeList textFNList = firstNameElement.getChildNodes();
                    System.out
                            .println("Name : "
                                    + ((Node) textFNList.item(0))
                                            .getNodeValue().trim());

                    // -------
                    NodeList lastNameList = firstPersonElement
                            .getElementsByTagName("EMPID");
                    Element lastNameElement = (Element) lastNameList.item(0);

                    NodeList textLNList = lastNameElement.getChildNodes();
                    System.out
                            .println("ID : "
                                    + ((Node) textLNList.item(0))
                                            .getNodeValue().trim());

                    // ----
                    NodeList ageList = firstPersonElement
                            .getElementsByTagName("Address");
                    Element ageElement = (Element) ageList.item(0);

                    NodeList textAgeList = ageElement.getChildNodes();
                    System.out.println("Address : "
                            + ((Node) textAgeList.item(0)).getNodeValue()
                                    .trim());



                }

            }

        } catch (SAXParseException err) {
            System.out.println("** Parsing error" + ", line "
                    + err.getLineNumber() + ", uri " + err.getSystemId());
            System.out.println(" " + err.getMessage());

        } catch (SAXException e) {
            Exception x = e.getException();
            ((x == null) ? e : x).printStackTrace();

        } catch (Throwable t) {
            t.printStackTrace();
        }

Upvotes: 0

Views: 349

Answers (4)

Nadrendion
Nadrendion

Reputation: 247

Try to parse the XML after you have corrected it. A well-formatted XML only have 1 value per XML-element, but may have multiple attributes:

<employee attribute="attrvalue">value-string or xml-element, not both</employee>

So a suggestion to how your XML should look would be as follows:

<?xml version="1.0" ?>
<Employee>
    <Name> Jack </Name>
    <EMPID> EMP001 </EMPID>
    <Address> 12 CA, USA</Address> 
</Employee>

EDIT: However, if you are recieving the XML from a source that you cannot change, then there is basically only one option left for you - manually parsing the XML after converting it to a regular java String.

Try to utilize the different string-methods such as substring, indexof etc. Example:

String empidStartElement = "<empid>";
String nameStartElement = "<name>";
String nameEndElement = empidStartElement;

String xml = "<employee><name>Jack<empid>emp001</empid></name></employee>";

Integer nameStartPosition = xml.indexOf(nameStartElement)+nameStartElement.length;
Integer nameEndPosition = xml.indexOf(nameEndElement);

String name = xml.substring(nameStartPosition, nameEndPosition);

Upvotes: 1

lookassh
lookassh

Reputation: 1

Just change the line:

NodeList listOfPersons = getElementsByTagName("NAME");

to:

NodeList listOfPersons = doc.getChildNodes();

output:

Root element of the doc is Employee

Name : Jack

ID : EMP001

Address : 12 CA, USA

Upvotes: 0

npinti
npinti

Reputation: 52185

Since the XML is, in itself broken XML parsing will fail.

Assuming that, despite broken, the XML file will always have that layout, you could use regular expressions to extract the data.

String str = "<?xml version=\"1.0\" ?>\n" +
                        "<Employee>\n" +
                        "<Name> Jack\n" +
                        "<EMPID> EMP001 <Address> 12 CA, USA</Address> \n" +
                        "</EMPID>\n" +
                        "</Name>\n" +
                        "</Employee>";
        str = str.replaceAll("\\n", "");
        Pattern p = Pattern.compile("<Name>(.+?)<EMPID>(.+?)<Address>(.+?)</Address>");
        Matcher m = p.matcher(str);
        while(m.find())
        {
            System.out.println("Name: " + m.group(1) + " EMPID: " + m.group(2) + " Address: " + m.group(3));
        }

Yields:

Name: Jack EMPID: EMP001 Address: 12 CA, USA

What does this pattern do:

  • <Name> will match the Name tag.
  • (.+?) will match what text follows the <Name> tag but will stop matching the moment it finds <EMPID>, since it is not a greedy pattern due to the ? added after the greedy operator + (this will be matched by the next section of the pattern. Also in this section, anything which matches will be placed in a group which can be later accessed.
  • Once that the name is extracted, the engine will attempt to match <EMPID> tag.
  • After that the <EMPID> tag has been matched, a process similar to step 2 will take place and the matched content will be placed in another group.
  • As for the next step, the code will look for the <Address>
  • Lastly, the regex will attempt to extract any characters which are in between the <Address> and </Address> tags and once again, anything that matches will be placed in a group.

Once that the regular expression parses the string, I am accessing the groups and printing their values. As an extra step, I am removing any new line characters to process the string as a one liner.

An introductory tutorial on regular expressions can be found here.

Upvotes: 2

Evgeniy Dorofeev
Evgeniy Dorofeev

Reputation: 136102

It is not well-formatted but it is well-formed http://en.wikipedia.org/wiki/Well-formed_document, you can parse it with any parser.

Upvotes: 1

Related Questions