user1979556
user1979556

Reputation:

regex to get measurements

I have these measurements in the document

5.3 x 2.5 cm
11 x 11 mm
7 mm 
13 x 12 x 14 mm
13x12cm

I need to extract 5.3 x 2.5 cm using python using regex.

So far my code is below but it does not work properly

x = "\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?"
by = "( )?(by|x)( )?"
cm = "(mm|cm|millimeter|centimeter|millimeters|centimeters)"
x_cm = "((" + x + " *(to|\-) *" + cm + ")" + "|(" + x + cm + "))"
xy_cm = "((" + x + cm + by + x + cm + ")" +"|(" + x + by + x + cm + ")" +"|(" + x + by + x + "))"
xyz_cm = "((" + x + cm + by + x + cm + by + x + cm + ")" + "|(" + x + by + x + by + x + cm + ")" + "|(" + x + by + x + by + x + "))"
m = "((" + xyz_cm + ")" + "|(" + xy_cm + ")" + "|(" + x_cm + "))"
a = re.compile(m)
print a.findall(text)

The output it gives:

[('13', '13', '13', '13', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('12', '12', '12', '12', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('4', '4', '4', '4', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''), ('25', '25', '25', '25', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''),

Upvotes: 7

Views: 3721

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627126

The only issues with the current regex are two:

  • You need to get rid of capturing groups since .findall will extract all the substrings captured rather than the whole match value (however, it is not crucial, you might as well use re.finditer and get match.group(0))
  • The main issue is that you did not group the x pattern, the number format alternation ruined the structure of the final pattern.

A quick fix will look like

x = "(?:\.\d{1,2}|\d{1,4}\.?\d{0,2}|\d{5}\.?\d?|\d{6}\.?)"
by = "(?: )?(?:by|x)(?: )?"
cm = "(?:mm|cm|millimeter|centimeter|millimeters|centimeters)"
x_cm = "(?:" + x + " *(?:to|\-) *" + cm + "|" + x + cm + ")"
xy_cm = "(?:" + x + cm + by + x + cm +"|" + x + by + x + cm +"|" + x + cm + by + x +"|" + x + by + x + ")"
xyz_cm = "(?:" + x + cm + by + x + cm + by + x + cm + "|" + x + by + x + by + x + cm + "|" + x + by + x + by + x + ")"
m = "{}|{}|{}".format(xyz_cm, xy_cm, x_cm) 

See the Python demo printing

['5.3 x 2.5', '11 x 11', '13 x 12 x 14', '13x12cm']

To further enhance it, think of all possibilities of x, by, cm and perhaps use str.format instead of concatenation.

Upvotes: 5

CodeMonkey
CodeMonkey

Reputation: 4758

With Regex you should always slowly build up your expression to get what you want. E.g.

s = "5.3 x 2.5 cm"

You want to find the numbers here?

re.findall("\d+", s)

gives you all the integers:

["5", "3", "2", "5"]

Ok, so what if your numbers can be floating point but don't have to be. Then you expand your expression with a non-capturing match group that has a dot and maybe some numbers following.

re.findall("\d+(?:\.\d*)?", s)

this gives you

["5.3", "2.5"]

Then you can take the multiplication with an arbitrary number of spaces around:

re.findall("(\d+(?:\.\d*)?)\s*x\s*(\d+(?:\.\d*)?)", s)

Putting the numbers in match groups now gives you a tuple.

[("5.3", "2.5")]

You can then go on with the units:

re.findall("(\d+(?:\.\d*)?)\s*x\s*(\d+(?:\.\d*)?)\s*(cm|mm)", s)

giving you the tuple you want:

[("5.3", "2.5", "cm")]

and so on.

If you build your regexes like this you have a chance to see what breaks from one change to the next. Debugging a huge regex like the one you posted above is a task not worth going at.

I wouldn't name my unit regex as cm that's quite confusing for anyone maintaining your code in the future. Apart from that you need some clear requirements on the number formats you want to allow. Maybe somebody will input scientific notation etc. Your regexes will become very complicated.

Upvotes: 7

Related Questions