Reputation: 205

regular expression for xml not working

I am trying to write a regular expression to match an xml document. Reason I am not using a xml parser immediately is because the file might contain multiple xml files (well formed or not), so I would like to remove not well formed before parsing.

xml structure:

<company>
    .....
    <Employees>
    .......
    </Employees>
</company>

code

    final String xmlString = "...";
    final List<String> data = new ArrayList<String>();
    try
    {
        final Pattern pattern = Pattern.compile("<company>(.+?)</company>", Pattern.DOTALL);
        final Matcher matcher = pattern.matcher(xmlString);
        while (matcher.find())
        {
            final Pattern pattern1 = Pattern.compile("<Employees>(.+?)</Employees>", Pattern.DOTALL);// "+?"
            final Matcher matcher1 = pattern1.matcher(matcher.group(1));
            if (matcher1.find())
            {
                data.add(matcher1.group(1));
            }
        }
    }
    catch (final Exception e)
    {

    }

This works fine if the xml string contains one well formed or not well formed xml string. but this doesn't work when you have a not well formed xml followed by well formed xml.

<company>
    <Employees>

   </Employees>
<company>
    .....
    <Employees>
    .......
    </Employees>
</company>

In this scenario it returns the whole string than the well formed xml.

Please help thanks!!

Upvotes: 0

Answers (2)

Michael Kay

Reputation: 163322

You're parsing a language that is similar to XML, but not quite the same.

So the first thing you need to do is to specify the grammar of that language: what constructs is your parser going to accept?

Then you need to write your parser. Almost certainly, the grammar of your language will be recursive, which means it will be beyond the capability of regular expressions to parse it. You may be able to write a parser using tools such as JavaCC.

But you need to do some reading. If you're attempting to do this job using regular expressions, this suggests that you aren't aware of the basic computer science behind the problem you are tackling. If you're a smart hacker, you may be able to knock something up that works on most of your input documents, but it will always be at risk of falling over on the next one, unless you understand the theory and apply it.

Upvotes: 0

Jim Garrison

Reputation: 86774

Doing this with a single regular expression is never going to work.

Assuming that the start and end tags appear on separate lines, you need to process the XML one line at a time, keeping track of what you have seen and buffering input until you identify a complete valid subdocument.

Pseudocode:

buffer = ""
while (line = read_input)
{
    if tag=="<company>"
    {
        buffer = "" // discard whatever we have seen since it didn't end with </company>
        buffer += line
    }
    else if tag=="</company>"
    {
        buffer += line
        write buffer
        buffer = ""
    }
    else
        buffer += line
}

This is the general idea of how to approach the problem... the specifics could be improved (left as an exercise).

Upvotes: 2

regular expression for xml not working

Answers (2)

Related Questions