Hisham Hijjawi
Hisham Hijjawi

Reputation: 2425

split on regex python between two strings but inclusive using re.split and return a list

I am trying to split a piece of text in a file formatted like this:

module 
some text
endmodule

module 
some other text
endmodule

between the words module and endmodule and still include module and endmodule in the output string.

This is not a duplicate of other regex questions because I am trying to use re.split() to return a list, not find.

This is the regex I've tried

s=file.read()
l=re.split("module(.*)endmodule",s)

but it won't split anything...

Ideally final output would be a list that includes both modules as strings,

['module\n sometext\n endmodule', 'module\n someothertext\n endmodule']

Upvotes: 0

Views: 566

Answers (2)

Julius Vainora
Julius Vainora

Reputation: 48251

We could use a positive lookbehind and a positive lookahead as in

print(re.split('(?<=endmodule)[.\n]*?(?=module)', s))

giving

['module\nsome text\nendmodule', 'module\nsome other text\nendmodule']

where

s = ("module\n"
     "some text\n"
     "endmodule\n\n"
     "module\n"
     "some other text\n"
     "endmodule")

Upvotes: 1

Emma
Emma

Reputation: 27763

My guess is that you might want to design an expression similar to:

module(.*?)endmodule

not sure though.

Test with re.finditer

import re

regex = r"module(.*?)endmodule"

test_str = ("module \n"
    "some text\n"
    "endmodule\n\n"
    "module \n"
    "some other text\n"
    "endmodule")

matches = re.finditer(regex, test_str, re.DOTALL)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

Test with re.findall

import re

regex = r"module(.*?)endmodule"

test_str = ("module \n"
    "some text\n"
    "endmodule\n\n"
    "module \n"
    "some other text\n"
    "endmodule")

print(re.findall(regex, test_str, re.DOTALL))

The expression is explained on the top right panel of this demo, if you wish to explore further or simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.

Upvotes: 1

Related Questions