art8080
art8080

Reputation: 21

Regular expression to capture different lines

I'm trying to find a better way to capture variable values from a file that stores some information but facing the problem with line breaks and spaces. For example, a DataSetList variable is given that stores a value in two different ways:

Input

Templates = <
  item
    Name = 'fruits'
    TemplateList = '7,12'
  end>
Surveys = <
  item
    ID = 542
    Name = 'apple'
  end
  item
    ID = 872
    Name = 'banana'
    DataSetList = '873,887,971,1055'
    PluginInfo = {something}
  end
  item
    ID = 437
    Name = 'cherry'
    DataSetList = 
      '438,452,536,620,704,788,1143,1179,1563,1647,1731,1839,1875,1851,' +
      '1863,2060,2359,2443,2469,2620'
    PluginInfo = {something}
  end>

The only way i've found to capture the values of the variables ID, Name, DataSetList variable values that are stored in 'item end' block is (My approach):

Expression

ID[\s\=]*(?P<UID>\d*)\s*Name[\s\=]*'(?P<Name>.*)'\s*DataSetList[\s\=]*(?P<DataSetList>(?:'[\d\,]*'[\s\+]*)*)
ID[\s\=]*(?P<UID>\d*)                                    # capture ID
\s*                                                      # match spaces 
Name[\s\=]*'(?P<Name>.*)'                                # capture Name
\s*                                                      # match spaces
DataSetList[\s\=]*(?P<DataSetList>(?:'[\d\,]*'[\s\+]*)*) # capture DataSetList

My approach output

{'UID': '872',
 'Name': 'banana',
 'DataSetList': "'873,887,971,1055'\n    "}

{'UID': '437',
 'Name': 'cherry',
 'DataSetList': "'438,452,536,620,704,788,1143,1179,1563,1647,1731,1839,1875,1851,' +\n      '1863,2060,2359,2443,2469,2620'\n    "}

Problem

I don't think my approach is good because named capturing group DataSetList also captures spaces, line breaks, literal + and finally requires postprocessing of values.

Any approaches or ideas to improve my regular expression would be very helpful. Unfortunately my knowledge base of regex isn't as deep as i would like it to be. It's very interesting to see how it's done in other ways

Upvotes: 2

Views: 71

Answers (1)

LeopardShark
LeopardShark

Reputation: 4416

You can improve the regex a bit.

ID[\s=]*(?P<UID>\d*)\s*Name[\s=]*'(?P<Name>.*)'\s*DataSetList[\s=]*(?P<DataSetList>'(?:[\d,]|'[\s+]*')*')

This gets rid of the unnecessary = and , escapes. The last part now won't match the whitespace after the final bit of the DataSetList.

I can't see a nice way to avoid having to post-process the DataSetList, if you stick to regular expressions.

If you need to do anything more complicated with this, I'd advise moving away from regexes. They are great for simple things, but it looks like in this case you'd be better off with a proper parser. If none already exists for the language you have here, you can use a parsing library such as Lark to create one without too much difficulty.

Upvotes: 1

Related Questions