How can I extract the information I want using this RegEx or better?

Question

So here's the Regular Expression I have so far.

r"(?s)(?<=([A-G][1-3])).*?(?=[A-G][1-3]|$)"

It looks behind for a letter followed by a number between A-G and 1-3 as well as doing the same when looking ahead. I've tested it using Regex101. Here's what it returns for each match

This is the string I'm testing it against,

"A1 **ACBFEKJRQ0Z+-** F2 **.,12STLMGHD** F1 **9)(** D2 **!?56WXP** C1 **IONVU43\"\'** E1 **Y87><** A3 **-=.,\'\"!?><()@**"

(the string shouldn't have any spaces but I needed to embolden the values between each Letter followed by a number so it is easier to see what I want)

What I want it to do is store the values between each of the matches for the group (The "Full Matches") and the matches for the group they coincide with to use later.

In the end I would like to end up with either a list of tuples or a dictionary for example:

dict = {"A1":"ACBFEKJRQ0Z+-", "F2":",12STLMGHD", "F1":"9)(", "next group match":"characters that follow"}

or

list_of_tuples = (["A1","ACBFEKJRQ0Z+-"], ["F2","12STLMGHD"], ["F1","9)("], ["next group match","characters that follow"])

The string being compared to the RegEx won't ever have something like "C1F2" btw

P.S. Excuse the terrible explanation, any help is greatly appreciated

Wiktor Stribiżew · Accepted Answer

I suggest

(?s)([A-G][1-3])((?:(?![A-G][1-3]).)*)

See the regex demo

The (?s) will enable . to match linebreaks, ([A-G][1-3]) will capture the uppercase letter+digit into Group 1 and ((?:(?![A-G][1-3]).)*) will match all text that is not starting the uppercase letter+digit sequence.

The same regex can be unrolled as ([A-G][1-3])([^A-G]*(?:[A-G](?![1-3])[^A-G]*)*) for better performance (no re.DOTALL modifier or (?s) is necessary with it). See this demo.

Python demo:

import re
regex = r"(?s)([A-G][1-3])((?:(?![A-G][1-3]).)*)"
test_str = """A1 ACBFEKJRQ0Z+-F2.,12STLMGHDF19)(D2!?56WXPC1IONVU43"'E1Y87><()@"""
dct = dict(re.findall(regex, test_str))
print(dct)

How can I extract the information I want using this RegEx or better?

Answers (1)

Related Questions