salient
salient

Reputation: 2486

Strip multiline python docstrings with regex

I want to strip all python docstrings out of a file using simple search and replace, and the following (extremely) simplistic regex does the job for one line doc strings:

Regex101.com

""".*"""

How can I extend that to work with multi-liners?

Tried to include \s in a number of places to no avail.

Upvotes: 2

Views: 2094

Answers (2)

Booboo
Booboo

Reputation: 44128

Sometimes there are multiline strings that are not docstrings. For example, you may have a complicated SQL query that extends across multiple lines. The following attempts to look for multiline strings that appear before class definitions and after function definitions.

import re

input_str = """'''
This is a class level docstring
'''
class Article:
    def print_it(self):
        '''
        method level docstring
        '''
        print('Article')
        sql = '''
SELECT * FROM mytable
WHERE DATE(purchased) >= '2020-01-01'
'''
"""
    
doc_reg_1 = r'("""|\'\'\')([\s\S]*?)(\1\s*)(?=class)'
doc_reg_2 = r'(\s+def\s+.*:\s*)\n(\s*"""|\s*\'\'\')([\s\S]*?)(\2[^\n\S]*)'
input_str = re.sub(doc_reg_1, '', input_str)
input_str = re.sub(doc_reg_2, r'\1', input_str)
print(input_str)

Prints:

class Article:
    def print_it(self):
        print('Article')
        sql = '''
SELECT * FROM mytable
WHERE DATE(purchased) >= '2020-01-01'
'''

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626853

As you cannot use an inline s (DOTALL) modifier, the usual workaround to match any char is using a character class with opposite shorthand character classes:

"""[\s\S]*?"""

or

"""[\d\D]*?"""

or

"""[\w\W]*?"""

will match """ then any 0+ chars, as few as possible as *? is a lazy quantfiier, and then trailing """.

Upvotes: 5

Related Questions