okkhoy
okkhoy

Reputation: 1338

python regex for extracting java comment

I am parsing Java source code using Python. I need to extract the comment text from the source. I have tried the following.

Take 1:

cmts = re.findall(r'/\*\*(.|[\r\n])*?\*/', lines)

Returns: blanks [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

Take 2: (added a grouping bracket around the regex)

cmts = re.findall(r'(/\*\*(.|[\r\n])*?\*/)', lines)

Returns

Single line comment (example only):

('/**\n\n * initialise the tag with the colors and the tag name\n\n */', ' ')

Multi line comment (example only):

('/**\n\n * Get the color related to a specified tag\n\n * @param tag the tag that we want to get the colour for\n\n * @return color of the tag in String\n\n */', ' ')

I am interested only in initialise the tag with the colors and the tag name or Get the color related to a specified tag, @param tag the tag that we want to get the colour for, @return color of the tag in String and am not able to get my head around it. Please give me some pointers!

Upvotes: 2

Views: 830

Answers (1)

randomir
randomir

Reputation: 18697

To extract comments (everything between /** and */), you can use:

re.findall(r'\*\*(.*?)\*\/', text, re.S)

(note how capture group can be simplified if re.S/re.DOTALL is used, when dot matches also newlines).

Then, for each match you can strip multiple whitespace/*, and replace \n with ,:

def comments(text):
    for comment in re.findall(r'\*\*(.*?)\*\/', text, re.S):
        yield re.sub('\n+', ',', re.sub(r'[ *]+', ' ', comment).strip())

For example:

>>> list(comments('/**\n\n     * Get the color related to a specified tag\n\n     * @param tag the tag that we want to get the colour for\n\n     * @return color of the tag in String\n\n     */'))
['Get the color related to a specified tag, @param tag the tag that we want to get the colour for, @return color of the tag in String']

Upvotes: 2

Related Questions