Kaiser
Kaiser

Reputation: 197

finding an element between a tag and a list of tags using regex

I want to find elements between two different tags but the catch is the first tag is constant but the second tag can be any tag belonging to a particular list.

for example a string

'TRSF BOOK TRANSFER CREDIT SND= abcd bank , 123  ORG= qwer123 OGB= qwerasd OBI= 123433'

I have a list of tags ['TRSF','SND=','ORG=','OGB=','OBI=']

edit : added the availability of '=' in the list itself

My output should look some what like this

TRSF : BOOK TRANSFER CREDIT 
SND : abcd bank , 123
ORG : qwer123
OGB : qwerasd
OBI : 123433

The order of tags, as well as the availability of the tags, may change also new tags may come into the picture

till now I was writing separate regex and string parsing code for each type but that seems impractical as the combination can be infinite

Here is what I was doing :

org = re.findall("ORG=(.*?) OGB=",string_1)
snd = re.findall("SND=(.*?) ORG=",string_1)
,,obi = string_1.partition('OBI=')

Is there any way to do it like

<tag>(.*?)<tag in list>

or any other method ?

Upvotes: 1

Views: 63

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627488

If the tag list is complete, you can use a regex like

\b(TRSF|SND|ORG|OGB|OBI)\b=?\s*(.*?)(?=\s*\b(?:TRSF|SND|ORG|OGB|OBI)\b|\Z)

See the regex demo. Details:

  • \b - a word boundary
  • (TRSF|SND|ORG|OGB|OBI) - a tag captured into Group 1
  • \b - a word boundary
  • =? - an optional =
  • \s* - 0+ whitespaces
  • (.*?) - Group 2: any zero or more chars, as few as possible
  • (?=\s*\b(?:TRSF|SND|ORG|OGB|OBI)\b|\Z) - either end of string (\Z) or zero or more whitespaces followed with a tag as a whole word.

See the Python demo:

import re
s='TRSF BOOK TRANSFER CREDIT SND= abcd bank , 123  ORG= qwer123 OGB= qwerasd OBI= 123433'
tags =  ['TRSF','SND','ORG','OGB','OBI']
print( dict(re.findall(fr'\b({"|".join(tags)})\b=?\s*(.*?)(?=\s*\b(?:{"|".join(tags)})\b|\Z)', s.strip(), re.DOTALL)) )
# => {'TRSF': 'BOOK TRANSFER CREDIT', 'SND': 'abcd bank , 123', 'ORG': 'qwer123', 'OGB': 'qwerasd', 'OBI': '123433'}

Note the re.DOTALL (equal to re.S) makes the . match any chars including line break chars.

Upvotes: 1

Related Questions