Reputation:
I have these measurements in the document
5.3 x 2.5 cm
11 x 11 mm
7 mm
13 x 12 x 14 mm
13x12cm
I need to extract 5.3 x 2.5 cm using python using regex.
So far my code is below but it does not work properly
x = "\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?"
by = "( )?(by|x)( )?"
cm = "(mm|cm|millimeter|centimeter|millimeters|centimeters)"
x_cm = "((" + x + " *(to|\-) *" + cm + ")" + "|(" + x + cm + "))"
xy_cm = "((" + x + cm + by + x + cm + ")" +"|(" + x + by + x + cm + ")" +"|(" + x + by + x + "))"
xyz_cm = "((" + x + cm + by + x + cm + by + x + cm + ")" + "|(" + x + by + x + by + x + cm + ")" + "|(" + x + by + x + by + x + "))"
m = "((" + xyz_cm + ")" + "|(" + xy_cm + ")" + "|(" + x_cm + "))"
a = re.compile(m)
print a.findall(text)
The output it gives:
[('13', '13', '13', '13', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('12', '12', '12', '12', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('4', '4', '4', '4', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('25', '25', '25', '25', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''),
Upvotes: 7
Views: 3721
Reputation: 627126
The only issues with the current regex are two:
.findall
will extract all the substrings captured rather than the whole match value (however, it is not crucial, you might as well use re.finditer
and get match.group(0)
)x
pattern, the number format alternation ruined the structure of the final pattern.A quick fix will look like
x = "(?:\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?)"
by = "(?: )?(?:by|x)(?: )?"
cm = "(?:mm|cm|millimeter|centimeter|millimeters|centimeters)"
x_cm = "(?:" + x + " *(?:to|\-) *" + cm + "|" + x + cm + ")"
xy_cm = "(?:" + x + cm + by + x + cm +"|" + x + by + x + cm +"|" + x + cm + by + x +"|" + x + by + x + ")"
xyz_cm = "(?:" + x + cm + by + x + cm + by + x + cm + "|" + x + by + x + by + x + cm + "|" + x + by + x + by + x + ")"
m = "{}|{}|{}".format(xyz_cm, xy_cm, x_cm)
See the Python demo printing
['5.3 x 2.5', '11 x 11', '13 x 12 x 14', '13x12cm']
To further enhance it, think of all possibilities of x
, by
, cm
and perhaps use str.format
instead of concatenation.
Upvotes: 5
Reputation: 4758
With Regex you should always slowly build up your expression to get what you want. E.g.
s = "5.3 x 2.5 cm"
You want to find the numbers here?
re.findall("\d+", s)
gives you all the integers:
["5", "3", "2", "5"]
Ok, so what if your numbers can be floating point but don't have to be. Then you expand your expression with a non-capturing match group that has a dot and maybe some numbers following.
re.findall("\d+(?:\.\d*)?", s)
this gives you
["5.3", "2.5"]
Then you can take the multiplication with an arbitrary number of spaces around:
re.findall("(\d+(?:\.\d*)?)\s*x\s*(\d+(?:\.\d*)?)", s)
Putting the numbers in match groups now gives you a tuple.
[("5.3", "2.5")]
You can then go on with the units:
re.findall("(\d+(?:\.\d*)?)\s*x\s*(\d+(?:\.\d*)?)\s*(cm|mm)", s)
giving you the tuple you want:
[("5.3", "2.5", "cm")]
and so on.
If you build your regexes like this you have a chance to see what breaks from one change to the next. Debugging a huge regex like the one you posted above is a task not worth going at.
I wouldn't name my unit regex as cm
that's quite confusing for anyone maintaining your code in the future. Apart from that you need some clear requirements on the number formats you want to allow. Maybe somebody will input scientific notation etc. Your regexes will become very complicated.
Upvotes: 7