Regular expression to identify a python function body and locate all the executable lines (ie non-comments)

I have a file that is a python code (may not be syntactically correct).

It has some functions which are commented out except the signature.

My goal is to detect those empty functions using a regex and clean them up.

Had it been only # kind of comment it would have been easier to locate if all lines had # in beginning between two lines starting with def but the issue is in many functions I have multi-line comments (actually, docstrings) as well.

If you could suggest a way to change multi-line comments to single line comments that would help too.

In case you are curious about what is this useful for, this is a part of a python tool where we are trying to automate some of the steps of code refactoring.

Input:

def this_function_has_stuff(f, g, K):
    """ Thisfunction has stuff in it """
    if f:
       s = 0
    else:
       u =0
    return None

def fuly_commented_fucntion(f, g, K):
    """
    remove this empty function.
    Examples
    ========
    >>> which function is
    >>> empty
    """

def empty_annotated_fn(name: str, result: List[100]) -> List[100]:    
    """
    Make some bla.
    Examples
    ========
    >>> bla bla
    >>> bla bla
    x**2 + 1
    """

def note_this_has_one_valid_line(f, K):
    """
    Make some bla.
    Examples
    ========
    >>> bla bla
    >>> bla bla
    x**2 + 1
    """
    return [K.abs(coff) for coff in f]

def empty_with_both_types_of_comment(f, K):
    """
    my bla bla
    Examples
    ========
    3
    """
    # if not f:
    # else:
    #    return max(dup_abs(f, K))

SOME_VAR = 6

Expected output:

def this_function_has_stuff(f, g, K):
    """ Thisfunction has stuff in it """
    if f:
       s = 0
    else:
       u =0
    return None

def note_this_has_one_valid_line(f, K):
    """
    Make some bla.
    Examples
    ========
    >>> bla bla
    >>> bla bla
    x**2 + 1
    """
    return [K.abs(coff) for coff in f]

SOME_VAR = 6

Upvotes: 0

Answers (3)

ScottC

Reputation: 4105

Ok. This is my attempt to use regex on the python file (eg. data.py) to produce the expected output. It probably will not cover every possible conceivable python file, however, the proof of concept does a good job with the data provided. The code would need to be updated to accommodate import statements etc.

Here is my code:

import re

# Import the python file to be processed (eg. data.py)
with open("data.py", "r") as f:
    python_file = f.read()

# A function to enumerate an iterator
def enum_iterable(iterator):
   i = 0
   for it in iterator:
      yield (i, it)
      i += 1


# Find all lines that are not within a definition
non_def_pattern = re.compile(r"(\n((?!def)(?!\s))[^\n]+)")
s = non_def_pattern.split(python_file)
str_list = list(filter(None, s))
non_definition_lines = "".join([item for item in str_list if item.startswith('\n')])

# Retain the lines that ARE within a definition
definition_lines = "\n".join([item for item in str_list if not item.startswith('\n')])

# Split the definition lines by definition
def_pattern = re.compile(r'(def[^\n]+\n)')
match = def_pattern.finditer(definition_lines)
def_dict = {}
for m, val in enum_iterable(match):
    def_dict.update({m: val})

split_def_lines = def_pattern.split(definition_lines)

# Remove blank element in first position if it exists
if split_def_lines[0] == '':
    split_def_lines.pop(0)

# Identify blocks that contain code
good_functions = ""
commBlock_pattern = re.compile(r'(\"{3})[^\"]+(\"{3})')
for i, val in enumerate(split_def_lines):
    if i%2 == 1:
        if '"""' in val:
            if len(commBlock_pattern.findall(val)) > 0:
                result = commBlock_pattern.sub("", val)
                # remove all spaces
                result = result.replace(" ", "")
                # remove lines starting with #
                result = re.sub(r'((\s+)?#[^\n]+\n)', "", result)
                # remove new lines
                result = result.replace("\n", "")

                # If there is any remaining text, then add the function to good_functions
                if len(result) > 0:
                    good_functions = good_functions + split_def_lines[i-1] + val

# Now add the non-def lines to the end of good functions
final_output = good_functions + non_definition_lines

print(final_output)

OUTPUT:

def this_function_has_stuff(f, g, K):
    """ Thisfunction has stuff in it """
    if f:
       s = 0
    else:
       u =0
    return None

def note_this_has_one_valid_line(f, K):
    """
    Make some bla.
    Examples
    ========
    >>> bla bla
    >>> bla bla
    x**2 + 1
    """
    return [K.abs(coff) for coff in f]


SOME_VAR = 6

Upvotes: 0

Rodrigo Rodrigues

Reputation: 8536

I advise you not to try to accomplish this with regex.

Python grammar is not a Regular Language, and even in your case where you are just interested in a small subset of the syntax, there are so many possible variations and corners that, it is just not worth trying to do this with regex.

Instead, I suggest you to explore the awesome ast module, that can effectively parse a source and iterate over the code as a tree. You can then check all function definitions, and see if they have or not a valid code line.

You can, for example, implement a custom NodeTransformer that removes function definitions that are effectively empty. You'd need to properly define what is the meaning of "empty", but based on your question, I'd say it would be any function that only has docstrings or pass or ... (ellipsis).

import ast

class Cleaner(ast.NodeTransformer):
  def __init__(self):
    self.removed = []

  def visit_FunctionDef(self, node):
    for stmt in node.body:
      if isinstance(stmt, ast.Pass):
        continue
      if isinstance(stmt, ast.Expr) and isinstance(stmt.value, ast.Constant):
        const = stmt.value.value
        if isinstance(const, str) or const is Ellipsis:
          continue
      break
    else:
      self.removed.append(node.name)
      return None
    return node

  def visit_AsyncFunctionDef(self, node):
      return self.visit_FunctionDef(node)

with open("my/path/to/file.py", "r") as source:
  tree = ast.parse(source.read())

cleaner = Cleaner()
cleaner.visit(tree)
print(cleaner.removed)    # ['fuly_commented_fucntion', 'empty_with_both_types_of_comment']
print(ast.unparse(tree))  # will print your source code without those functions

There are a few limitations to this approach, and you should be aware:

ast does not work for syntactically incorrect source.
ast.parse ignores and removes comments, so if you unparse it, all the comments will be gone.
a function body might not be implemented and anyway it could be referenced somewhere in code, so refactoring functions only by checking if their body is empty is not safe.
this implementation does not check for nested functions. It could be done (just call self.generic_visit(node) inside the visitor methods), but it would raise a question: a function whose body only has empty nested functions, is itself empty?

One thing you can do, instead of unparsing the tree, is to use it only to identify the names of the unimplemented functions, then use a regular expression to find and remove their definitions (for example, see the answer from @megaultron below)

Upvotes: 1

sysOut

Reputation: 419

Use the following regex:

(def (?!fuly_commented_fucntion|empty_with_both_types_of_comment).*(?:\n.+)+)

?! deny the methods

(?:\n.+)+) do the line break

match.group(groupNum) in the code below contains the functions in string

the complete code

import re

#regex
regex = r"(def (?!fuly_commented_fucntion|empty_with_both_types_of_comment).*(?:\n.+)+)"

test_str = ("\n"
    "def this_function_has_stuff(f, g, K):\n"
    "    \"\"\" Thisfunction has stuff in it \"\"\"\n"
    "    if f:\n"
    "       s = 0\n"
    "    else:\n"
    "       u =0\n"
    "    return None\n\n"
    "def fuly_commented_fucntion(f, g, K):\n"
    "    \"\"\"\n"
    "    remove this empty function.\n"
    "    Examples\n"
    "    ========\n"
    "    >>> which function is\n"
    "    >>> empty\n"
    "    \"\"\"\n\n"
    "def note_this_has_one_valid_line(f, K):\n"
    "    \"\"\"\n"
    "    Make some bla.\n"
    "    Examples\n"
    "    ========\n"
    "    >>> bla bla\n"
    "    >>> bla bla\n"
    "    x**2 + 1\n"
    "    \"\"\"\n"
    "    return [K.abs(coff) for coff in f]\n\n"
    "def empty_with_both_types_of_comment(f, K):\n"
    "    \"\"\"\n"
    "    my bla bla\n"
    "    Examples\n"
    "    ========\n"
    "    3\n"
    "    \"\"\"\n"
    "    # if not f:\n"
    "    # else:\n"
    "    #    return max(dup_abs(f, K))\n\n"
    "SOME_VAR = 6")

matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
    
    for groupNum in range(0, len(match.groups())):
        print('==============your methods=====================')
        groupNum = groupNum + 1        
        print (match.group(groupNum))

Upvotes: -2

Regular expression to identify a python function body and locate all the executable lines (ie non-comments)

Answers (3)

Related Questions