python regex for extracting java comment

Question

I am parsing Java source code using Python. I need to extract the comment text from the source. I have tried the following.

Take 1:

cmts = re.findall(r'/\*\*(.|[ ])*?\*/', lines)

Returns: blanks [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

Take 2: (added a grouping bracket around the regex)

cmts = re.findall(r'(/\*\*(.|[ ])*?\*/)', lines)

Returns

Single line comment (example only):

('/** * initialise the tag with the colors and the tag name */', ' ')

Multi line comment (example only):

('/** * Get the color related to a specified tag * @param tag the tag that we want to get the colour for * @return color of the tag in String */', ' ')

I am interested only in initialise the tag with the colors and the tag name or Get the color related to a specified tag, @param tag the tag that we want to get the colour for, @return color of the tag in String and am not able to get my head around it. Please give me some pointers!

randomir · Accepted Answer

To extract comments (everything between /** and */), you can use:

re.findall(r'\*\*(.*?)\*/', text, re.S)

(note how capture group can be simplified if re.S/re.DOTALL is used, when dot matches also newlines).

Then, for each match you can strip multiple whitespace/*, and replace with ,:

def comments(text):
    for comment in re.findall(r'\*\*(.*?)\*/', text, re.S):
        yield re.sub('
+', ',', re.sub(r'[ *]+', ' ', comment).strip())

For example:

>>> list(comments('/**

     * Get the color related to a specified tag

     * @param tag the tag that we want to get the colour for

     * @return color of the tag in String

     */'))
['Get the color related to a specified tag, @param tag the tag that we want to get the colour for, @return color of the tag in String']

python regex for extracting java comment

Answers (1)

Related Questions