Extracting Java class names from Javadoc with regex

Question

The problem is following: I am having a Javadoc-generated HTML file containing Java class names and some additional information, like this:

{@link ml.foo.bar.BazAccEd} (Text) Some text
{@link ml.foo.bar.BazAccGrp} (Text) Some text BazAccGrpList
{@link ml.foo.bar.BazAccEdOrGroup} (Text) Some text {@link.ml.foo.bar.BazAccEdList}

I need to extract from it (using Ant regex capabilities) only the short names of Java classes and only where they are parts of links, inserting commas in place of the original ordinary text, so that the sample above would produce

BazAccEd
BazAccGrp
BazAccEdOrGroup, BazAccEdList

It probably isn't anything too complicated yet I fail to come across the correct regular expression that would parse only the links and extract the correct data from them. Thanks in advance.

alan · Accepted Answer

This should work, given the inputs you provided. It works by capturing the text between a period and a closing curly brace:

\.([A-Za-z\d_]+)(?=})(?:.+\.([A-Za-z\d_]+)(?=}))*

This will return two captured groups \1 and \2. In order to get the comma replace working correctly, you'll have to check to see if there's anything in \2. If so, insert a comma between \1 and \2.

Explanation:

\.([A-Za-z\d_]+)(?=}) # look for a period, characters, and lookahead for closing curly brace. Capture the characters
(?:          # open a non-capturing group
.+           # gobble up characters until ...
\.([A-Za-z\d_]+)(?=}) # ... you find the same thing as in the first line above
)*           # make the non-capturing group optional

Extracting Java class names from Javadoc with regex

Answers (2)

Related Questions