Reputation: 18184
I have a file that is a python code (may not be syntactically correct).
It has some functions which are commented out except the signature.
My goal is to detect those empty functions using a regex and clean them up.
Had it been only #
kind of comment it would have been easier to locate if all lines had #
in beginning between two lines starting with def
but the issue is in many functions I have multi-line comments (actually, docstrings) as well.
If you could suggest a way to change multi-line comments to single line comments that would help too.
In case you are curious about what is this useful for, this is a part of a python tool where we are trying to automate some of the steps of code refactoring.
Input:
def this_function_has_stuff(f, g, K):
""" Thisfunction has stuff in it """
if f:
s = 0
else:
u =0
return None
def fuly_commented_fucntion(f, g, K):
"""
remove this empty function.
Examples
========
>>> which function is
>>> empty
"""
def empty_annotated_fn(name: str, result: List[100]) -> List[100]:
"""
Make some bla.
Examples
========
>>> bla bla
>>> bla bla
x**2 + 1
"""
def note_this_has_one_valid_line(f, K):
"""
Make some bla.
Examples
========
>>> bla bla
>>> bla bla
x**2 + 1
"""
return [K.abs(coff) for coff in f]
def empty_with_both_types_of_comment(f, K):
"""
my bla bla
Examples
========
3
"""
# if not f:
# else:
# return max(dup_abs(f, K))
SOME_VAR = 6
Expected output:
def this_function_has_stuff(f, g, K):
""" Thisfunction has stuff in it """
if f:
s = 0
else:
u =0
return None
def note_this_has_one_valid_line(f, K):
"""
Make some bla.
Examples
========
>>> bla bla
>>> bla bla
x**2 + 1
"""
return [K.abs(coff) for coff in f]
SOME_VAR = 6
Upvotes: 0
Views: 464
Reputation: 4105
Ok. This is my attempt to use regex on the python file (eg. data.py) to produce the expected output. It probably will not cover every possible conceivable python file, however, the proof of concept does a good job with the data provided. The code would need to be updated to accommodate import statements
etc.
Here is my code:
import re
# Import the python file to be processed (eg. data.py)
with open("data.py", "r") as f:
python_file = f.read()
# A function to enumerate an iterator
def enum_iterable(iterator):
i = 0
for it in iterator:
yield (i, it)
i += 1
# Find all lines that are not within a definition
non_def_pattern = re.compile(r"(\n((?!def)(?!\s))[^\n]+)")
s = non_def_pattern.split(python_file)
str_list = list(filter(None, s))
non_definition_lines = "".join([item for item in str_list if item.startswith('\n')])
# Retain the lines that ARE within a definition
definition_lines = "\n".join([item for item in str_list if not item.startswith('\n')])
# Split the definition lines by definition
def_pattern = re.compile(r'(def[^\n]+\n)')
match = def_pattern.finditer(definition_lines)
def_dict = {}
for m, val in enum_iterable(match):
def_dict.update({m: val})
split_def_lines = def_pattern.split(definition_lines)
# Remove blank element in first position if it exists
if split_def_lines[0] == '':
split_def_lines.pop(0)
# Identify blocks that contain code
good_functions = ""
commBlock_pattern = re.compile(r'(\"{3})[^\"]+(\"{3})')
for i, val in enumerate(split_def_lines):
if i%2 == 1:
if '"""' in val:
if len(commBlock_pattern.findall(val)) > 0:
result = commBlock_pattern.sub("", val)
# remove all spaces
result = result.replace(" ", "")
# remove lines starting with #
result = re.sub(r'((\s+)?#[^\n]+\n)', "", result)
# remove new lines
result = result.replace("\n", "")
# If there is any remaining text, then add the function to good_functions
if len(result) > 0:
good_functions = good_functions + split_def_lines[i-1] + val
# Now add the non-def lines to the end of good functions
final_output = good_functions + non_definition_lines
print(final_output)
OUTPUT:
def this_function_has_stuff(f, g, K):
""" Thisfunction has stuff in it """
if f:
s = 0
else:
u =0
return None
def note_this_has_one_valid_line(f, K):
"""
Make some bla.
Examples
========
>>> bla bla
>>> bla bla
x**2 + 1
"""
return [K.abs(coff) for coff in f]
SOME_VAR = 6
Upvotes: 0
Reputation: 8536
I advise you not to try to accomplish this with regex.
Python grammar is not a Regular Language, and even in your case where you are just interested in a small subset of the syntax, there are so many possible variations and corners that, it is just not worth trying to do this with regex.
Instead, I suggest you to explore the awesome ast module, that can effectively parse a source and iterate over the code as a tree. You can then check all function definitions, and see if they have or not a valid code line.
You can, for example, implement a custom NodeTransformer that removes function definitions that are effectively empty. You'd need to properly define what is the meaning of "empty", but based on your question, I'd say it would be any function that only has docstrings or pass
or ...
(ellipsis).
import ast
class Cleaner(ast.NodeTransformer):
def __init__(self):
self.removed = []
def visit_FunctionDef(self, node):
for stmt in node.body:
if isinstance(stmt, ast.Pass):
continue
if isinstance(stmt, ast.Expr) and isinstance(stmt.value, ast.Constant):
const = stmt.value.value
if isinstance(const, str) or const is Ellipsis:
continue
break
else:
self.removed.append(node.name)
return None
return node
def visit_AsyncFunctionDef(self, node):
return self.visit_FunctionDef(node)
with open("my/path/to/file.py", "r") as source:
tree = ast.parse(source.read())
cleaner = Cleaner()
cleaner.visit(tree)
print(cleaner.removed) # ['fuly_commented_fucntion', 'empty_with_both_types_of_comment']
print(ast.unparse(tree)) # will print your source code without those functions
There are a few limitations to this approach, and you should be aware:
ast
does not work for syntactically incorrect source.ast.parse
ignores and removes comments, so if you unparse it, all the comments will be gone.self.generic_visit(node)
inside the visitor methods), but it would raise a question: a function whose body only has empty nested functions, is itself empty?One thing you can do, instead of unparsing the tree, is to use it only to identify the names of the unimplemented functions, then use a regular expression to find and remove their definitions (for example, see the answer from @megaultron below)
Upvotes: 1
Reputation: 419
Use the following regex:
(def (?!fuly_commented_fucntion|empty_with_both_types_of_comment).*(?:\n.+)+)
?!
deny the methods
(?:\n.+)+)
do the line break
match.group(groupNum)
in the code below contains the functions in string
the complete code
import re
#regex
regex = r"(def (?!fuly_commented_fucntion|empty_with_both_types_of_comment).*(?:\n.+)+)"
test_str = ("\n"
"def this_function_has_stuff(f, g, K):\n"
" \"\"\" Thisfunction has stuff in it \"\"\"\n"
" if f:\n"
" s = 0\n"
" else:\n"
" u =0\n"
" return None\n\n"
"def fuly_commented_fucntion(f, g, K):\n"
" \"\"\"\n"
" remove this empty function.\n"
" Examples\n"
" ========\n"
" >>> which function is\n"
" >>> empty\n"
" \"\"\"\n\n"
"def note_this_has_one_valid_line(f, K):\n"
" \"\"\"\n"
" Make some bla.\n"
" Examples\n"
" ========\n"
" >>> bla bla\n"
" >>> bla bla\n"
" x**2 + 1\n"
" \"\"\"\n"
" return [K.abs(coff) for coff in f]\n\n"
"def empty_with_both_types_of_comment(f, K):\n"
" \"\"\"\n"
" my bla bla\n"
" Examples\n"
" ========\n"
" 3\n"
" \"\"\"\n"
" # if not f:\n"
" # else:\n"
" # return max(dup_abs(f, K))\n\n"
"SOME_VAR = 6")
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
for groupNum in range(0, len(match.groups())):
print('==============your methods=====================')
groupNum = groupNum + 1
print (match.group(groupNum))
Upvotes: -2