aria
aria

Reputation: 23

How to get semicolons except in parentheses with regex

For the following C source code piece:

for (j=0; j<len; j++) a = (s) + (4); test = 5;

I want to insert \n after semicolons ; except in parenthesis using python code regex module.

For the following C source code piece:

for (j=0; j<(len); (j++)) a = (s) + (4); test = 5;

The regex ;(?![^(]*\)) works but not on the first piece of code.

Upvotes: 2

Views: 166

Answers (2)

2d4d
2d4d

Reputation: 33

You need to count opened and closed brackets for each regex match and only insert the newline, if there are more openend than closed brackets. This is done in replacement() which is called on each match of the regex. The regex searches for "(" and ")" just for counting, and for ";" to leave it or insert newline

import re

def replacement(matched_list):
    global bracket_count
    matched_char=matched_list.group(1)
    if "(" in matched_char:
        bracket_count += 1
        # don't replace, just return what was found
        return matched_char 
    elif ")" in matched_char:
        bracket_count -= 1
        # don't replace, just return what was found
        return matched_char 
    elif ";" in matched_char:
        # if we're inside brackets, insert \n
        if bracket_count == 0:
            return ';\n'
        # if not, leave it intact
        else:
            return ';'

# example 1
bracket_count=0
code="for (j=0; j<len; j++) a = (s) + (4); test = 5;"
new_code = re.sub('([();] ?)', replacement, code)
print(code)
print(new_code)

# example 2
bracket_count=0
code="for (j=0; j<(len); (j++)) a = (s) + (4); test = 5;"
new_code = re.sub('([();])', replacement, code)
print(code)
print(new_code)

# example 3
bracket_count=0
code="for (j=0; j<len; j++) test = 5; a = (s) + (4);"
new_code = re.sub('([();])', replacement, code)
print(code)
print(new_code)

Result:

for (j=0; j<len; j++) a = (s) + (4); test = 5;
for (j=0; j<len; j++) a = (s) + (4);
test = 5;

for (j=0; j<(len); (j++)) a = (s) + (4); test = 5;
for (j=0; j<(len); (j++)) a = (s) + (4);
test = 5;

Upvotes: 2

Jongware
Jongware

Reputation: 22457

Use a custom replacement function:

re.sub(pattern, repl, string, count=0, flags=0)
...
If repl is a function, it is called for every non-overlapping occurrence of pattern.

The function repl is called for every occurrence of a single ; and for parenthesized expressions. Since re.sub does not find overlapping sequences, the very first opening parenthesis will trigger a full match all the way up to the last closing parenthesis.

import re

def repl(m):
    contents = m.group(1)
    if '(' in contents:
        return contents
    return ';\n'

str1 = 'for (j=0; j<len; j++) a = (s) + (4); test = 5;'
str2 = 'for (j=0; j<(len); (j++)) a = (s) + (4); test = 5;'

print (re.sub (r'(;\s*|\(.*\))', repl, str1))
print (re.sub (r'(;\s*|\(.*\))', repl, str2))

Result:

for (j=0; j<len; j++) a = (s) + (4);
test = 5;

for (j=0; j<(len); (j++)) a = (s) + (4);
test = 5;

Mission accomplished, for your (very little) sample data.

But wait!

A small – but valid – change in one of the examples

str1 = 'for (j=0; j<len; j++) test = 5; a = (s) + (4);'

breaks this with a wrong output:

for (j=0; j<len; j++) test = 5; a = (s) + (4);

There is no way around it, you need a state machine instead:

def state_match (text):
    parentheses = 0
    drop_space = False
    result = ''
    for character in text:
        if character == '(':
            parentheses += 1
            result += '('
        elif character == ')':
            parentheses -= 1
            result += ')'
        elif character == ' ':
            if not drop_space:
                result += ' '
            drop_space = False
        elif character == ';':
            if parentheses:
                result += character
            else:
                result += ';\n'
                drop_space = True
        else:
            result += character
    return result

str1 = 'for (j=0; j<len; j++) a = (s) + (4); test = 5;'
str2 = 'for (j=0; j<(len); (j++)) a = (s) + (4); test = 5;'
str3 = 'for (j=0; j<len; j++) test = 5; a = (s) + (4);'

print (state_match(str1))
print (state_match(str2))
print (state_match(str3))

results correctly in:

for (j=0; j<len; j++) a = (s) + (4);
test = 5;

for (j=0; j<(len); (j++)) a = (s) + (4);
test = 5;

for (j=0; j<len; j++) test = 5;
a = (s) + (4);

Upvotes: 1

Related Questions