Reputation: 7120

Java regular expression to extract content between tags

Input :

<tag>Testing different formatting options in </tag><tag class="classA classB">Text</tag><tag class="classC">Class C text</tag>

Expected Output :

<tag>Testing different formatting options in </tag><tagA><tabB>Text</tagA></tagB><tagC>Class C text</tag>

Basically the tag is replaced by tags based on the attributes in "class". ie., if the attributes has a classA attribute then the tag will be replaced by tagA, if classB attribute is also present then the tag will also include tagB and so on..

Attempt made :

    final String TAG_GROUPS = "<tag class=\"(.*)\">(.*)</tag>";
    Pattern pattern = Pattern.compile(TAG_GROUPS);
    Matcher matcher = pattern.matcher(inputString);

The output I am getting fails to find the matching tags. In particular the statement

    String classes = matcher.group(1);

gives the string classA classB">Text</tag><tag class="classC">Class C text</tag. The pattern matcher is failing to find the matching tags. I am a beginner to regular expressions and I would like to know the right pattern for the problem. Any help is appreciated.

Upvotes: 0

Answers (3)

PhoneixS

Reputation: 11036

When you use * it will try to absorb all possible characters (greedy).

If you want that .* to match the less possible characters you must use lazy match with *?.

So your regex get as:

<tag class=\"(.*?)\">(.*?)</tag>

Above, is the easy way. But isn't necessary the optimum way. Lazy match is more slow than greedy and if you can, you must try to avoid it. For example if you estimate that you code will be correct (not tag broken without a close tag, etc) is better that you use negate classes instead of .*?. For example, you regex can be write as:

<tag class="([^"]*)">([^<]*)</tag>

Witch is more efficient for the regex engine (although is not always possible to convert lazy match to negate class).

And of course, if you are trying to parse a complete html or xml document in witch you must do many different changes, it's better to use a xml (html) parser.

Upvotes: 1

AlexR

Reputation: 115388

You should use greedy regular expression: "<tag class=\"(.*?)\">(.*)</tag>". Otherwise .* matches any characters including </tag>.

But generally I agree with guys that this is not the best practice to parse XML using regular expressions. Use XML parser instead.

Upvotes: 2

Aaron Digulla

Reputation: 328790

While you could use regexp to locate the start tags and parse the classes, there is no way to produce nested tags as output. See this answer for details.

What you could do is write your own simple HTML parser but HTML is pretty messy to parse. Or to put it another way: Have a look at my reputation and then consider that I wouldn't try it without a really good reason (like someone paying me half a million dollars).

Use a real HTML parser like HTML Tidy instead.

Upvotes: 1

Java regular expression to extract content between tags

Answers (3)

Related Questions