Reputation: 1338
I am parsing Java source code using Python. I need to extract the comment text from the source. I have tried the following.
Take 1:
cmts = re.findall(r'/\*\*(.|[\r\n])*?\*/', lines)
Returns: blanks [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
Take 2: (added a grouping bracket around the regex)
cmts = re.findall(r'(/\*\*(.|[\r\n])*?\*/)', lines)
Returns
Single line comment (example only):
('/**\n\n * initialise the tag with the colors and the tag name\n\n */', ' ')
Multi line comment (example only):
('/**\n\n * Get the color related to a specified tag\n\n * @param tag the tag that we want to get the colour for\n\n * @return color of the tag in String\n\n */', ' ')
I am interested only in initialise the tag with the colors and the tag name
or Get the color related to a specified tag, @param tag the tag that we want to get the colour for, @return color of the tag in String
and am not able to get my head around it. Please give me some pointers!
Upvotes: 2
Views: 830
Reputation: 18697
To extract comments (everything between /**
and */
), you can use:
re.findall(r'\*\*(.*?)\*\/', text, re.S)
(note how capture group can be simplified if re.S
/re.DOTALL
is used, when dot matches also newlines).
Then, for each match you can strip multiple whitespace/*
, and replace \n
with ,
:
def comments(text):
for comment in re.findall(r'\*\*(.*?)\*\/', text, re.S):
yield re.sub('\n+', ',', re.sub(r'[ *]+', ' ', comment).strip())
For example:
>>> list(comments('/**\n\n * Get the color related to a specified tag\n\n * @param tag the tag that we want to get the colour for\n\n * @return color of the tag in String\n\n */'))
['Get the color related to a specified tag, @param tag the tag that we want to get the colour for, @return color of the tag in String']
Upvotes: 2