Reputation: 21
I'm trying to find a better way to capture variable values from a file that stores some information but facing the problem with line breaks and spaces. For example, a DataSetList variable is given that stores a value in two different ways:
Templates = <
item
Name = 'fruits'
TemplateList = '7,12'
end>
Surveys = <
item
ID = 542
Name = 'apple'
end
item
ID = 872
Name = 'banana'
DataSetList = '873,887,971,1055'
PluginInfo = {something}
end
item
ID = 437
Name = 'cherry'
DataSetList =
'438,452,536,620,704,788,1143,1179,1563,1647,1731,1839,1875,1851,' +
'1863,2060,2359,2443,2469,2620'
PluginInfo = {something}
end>
The only way i've found to capture the values of the variables ID, Name, DataSetList variable values that are stored in 'item end' block is (My approach):
ID[\s\=]*(?P<UID>\d*)\s*Name[\s\=]*'(?P<Name>.*)'\s*DataSetList[\s\=]*(?P<DataSetList>(?:'[\d\,]*'[\s\+]*)*)
ID[\s\=]*(?P<UID>\d*) # capture ID
\s* # match spaces
Name[\s\=]*'(?P<Name>.*)' # capture Name
\s* # match spaces
DataSetList[\s\=]*(?P<DataSetList>(?:'[\d\,]*'[\s\+]*)*) # capture DataSetList
{'UID': '872',
'Name': 'banana',
'DataSetList': "'873,887,971,1055'\n "}
{'UID': '437',
'Name': 'cherry',
'DataSetList': "'438,452,536,620,704,788,1143,1179,1563,1647,1731,1839,1875,1851,' +\n '1863,2060,2359,2443,2469,2620'\n "}
I don't think my approach is good because named capturing group DataSetList also captures spaces, line breaks, literal + and finally requires postprocessing of values.
Any approaches or ideas to improve my regular expression would be very helpful. Unfortunately my knowledge base of regex isn't as deep as i would like it to be. It's very interesting to see how it's done in other ways
Upvotes: 2
Views: 71
Reputation: 4416
You can improve the regex a bit.
ID[\s=]*(?P<UID>\d*)\s*Name[\s=]*'(?P<Name>.*)'\s*DataSetList[\s=]*(?P<DataSetList>'(?:[\d,]|'[\s+]*')*')
This gets rid of the unnecessary =
and ,
escapes. The last part now won't match the whitespace after the final bit of the DataSetList
.
I can't see a nice way to avoid having to post-process the DataSetList
, if you stick to regular expressions.
If you need to do anything more complicated with this, I'd advise moving away from regexes. They are great for simple things, but it looks like in this case you'd be better off with a proper parser. If none already exists for the language you have here, you can use a parsing library such as Lark to create one without too much difficulty.
Upvotes: 1