Reputation: 43
I am struggling to correctly parse the text. There is a lot of variation in the text. Ideally I would like to do this in Python, but any language would work.
Example strings:
"if magic code is 543, and type is 5642, 912342, or 7425, type has to have a period. EX: 02-15-99"
"If Magic Code is 722, 43, or 643256 and types is 43234, 5356, and 2112, type has to start with period."
"if magic code is 4542 it is not valid in type."
"if magic code is 532 and date is within 10 years from current data, and the type is 43, the type must begin with law number."
Results that I would like:
[543] [5642, 912342, 7425][type has to have a period.]
[722, 43, 643256][3234, 5356, and 2112][type has to start with period.]
[4542][it is not valid in type.]
[532][43][the type must begin with law number.]
There are other variations, but you see the concept. Sorry I am not very good with regular expressions.
Upvotes: 1
Views: 767
Reputation: 5213
Here's a solution with a single regular expression plus some cleaning up after the fact. This works for all your examples, but as stated in the comments, if your sentences vary much more than this you should explore options other than regex.
import re
sentences = ["if magic code is 543, and type is 5642, 912342, or 7425, type has to have a period. EX: 02-15-99",
"If Magic Code is 722, 43, or 643256 and types is 43234, 5356, and 2112, type has to start with period.",
"if magic code is 4542 it is not valid in type.",
"if magic code is 532 and date is within 10 years from current data, and the type is 43, the type must begin with law number."]
pat = '(?i)^if\smagic\scode\sis\s(\d+(?:,?\s(?:\d+|or))*)(?:.*types?\sis\s(\d+(?:,?\s(?:\d+|or|and))*,)(.*\.)|(.*\.))'
find_ints = lambda s: [int(d) for d in re.findall('\d+', s)]
matches = [[g for g in re.match(pat,s).groups() if g] for s in sentences]
results = [[find_ints(m) for m in match[:-1]]+[[match[-1].strip()]] for match in matches]
And if you need things printed nicely like in your example:
for r in results:
print(*r, sep='')
Upvotes: 0
Reputation: 42421
Well ... this does what you asked. But it's very ugly and quite specific to the examples you've provided. I suspect it will fail against the real data file.
When faced with this kind of parsing job, one way to approach the problem is to run the input data through some preliminary cleanups, simplifying and rationalizing the text where possible. For example, handling the different flavors of lists-of-integers is annoying and makes the regexes more complex. If you could removed the needless commas-between-integers and drop the terminal "or-and" the regexes can be much simpler. Once that kind of cleanup done, sometimes you can apply one or more regexes to extract the needed bits. In some cases, the number of outliers that fail to meet the main regexes can be handled with specific lookups or hard-coded special-case rules.
import re
lines = [
"if magic code is 543, and type is 5642, 912342, or 7425, type has to have a period. EX: 02-15-99",
"If Magic Code is 722, 43, or 643256 and types is 43234, 5356, and 2112, type has to start with period.",
"if magic code is 4542 it is not valid in type.",
"if magic code is 532 and date is within 10 years from current data, and the type is 43, the type must begin with law number.",
]
mcs_rgx = re.compile(r'magic code is (\d+ (or|and) \d+|\d+(, \d+)*,? (or|and) \d+|\d+)', re.IGNORECASE)
types_rgx = re.compile(r'types? is (\d+ (or|and) \d+|\d+(, \d+)*,? (or|and) \d+|\d+)', re.IGNORECASE)
rest_rgx1 = re.compile(r'(type (has|must).+)')
rest_rgx2 = re.compile(r'.+\d(.+)')
nums_rgx = re.compile(r'\d+')
for line in lines:
m = mcs_rgx.search(line)
if m:
mcs_text = m.group(1)
mcs = map(int, nums_rgx.findall(mcs_text))
else:
mcs = []
m = types_rgx.search(line)
if m:
types_text = m.group(1)
types = map(int, nums_rgx.findall(types_text))
else:
types = []
m = rest_rgx1.search(line)
if m:
rest = [m.group(1)]
else:
m = rest_rgx2.search(line)
if m:
rest = [m.group(1)]
else:
rest = ['']
print mcs, types, rest
Output:
[543] [5642, 912342, 7425] ['type has to have a period. EX: 02-15-99']
[722, 43, 643256] [43234, 5356, 2112] ['type has to start with period.']
[4542] [] [' it is not valid in type.']
[532] [43] ['type must begin with law number.']
Upvotes: 1