VIPUL VAIBHAV
VIPUL VAIBHAV

Reputation: 87

Regular Expression to extract quantity with dimensions from text in Python

I'm trying to extract dimensions and units from text.

The data could look like anything:

53 inch x 45 inch

10 in by 5 in

53" W x 74" L x 15" H

53 inch W x 74 inch L x 15 inch H

There are posts which cover the first two cases but I was not able to understand how to deal with case 3 and 4 here.

This is what I tried for the basics from this but somehow it doesn't work:

import re
regex = r"(?<!\S)\d+(?:,\d+)?\s*(?:inch|in| in|\")* ?x ?\d+(?:,\d+)?(?: ?x ?\d+(?:,\d+)?)*\s*(?:inch| inch|in| in|\")*"
test_str = ("15 mm x 2 mm x 3")
result = re.findall(regex, test_str)  
print(result)

Also, I just want to extract just these because I am using Quantulum for just extracting other numeric values but it fails in this case. So any guidance on how to merge the two things to function together would be very much appreciated.

Thank you for help

Upvotes: 5

Views: 1107

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627126

You can use

(?<!\S)(\d+(?:,\d+)?) *(?:(?:in(?:ch)?|")(?: +W)?)? ?(?:x|by) ?(\d+(?:,\d+)?)(?: ?x ?\d+(?:,\d+)?)* *(?:(?:in(?:ch)?|")(?: +L)?)?(?: ?x ?(\d+(?:,\d+)?))* *(?:(?:in(?:ch)?|")(?: +H)?)?

See the regex demo.

Certainly, \s is better instead of literal spaces in the pattern as it can match any whitespace:

(?<!\S)(\d+(?:,\d+)?)\s*(?:(?:in(?:ch)?|")(?:\s+W)?)?\s?(?:x|by)\s?(\d+(?:,\d+)?)(?:\s?x\s?\d+(?:,\d+)?)*\s*(?:(?:in(?:ch)?|")(?:\s+L)?)?(?:\s?x\s?(\d+(?:,\d+)?))*\s*(?:(?:in(?:ch)?|")(?:\s+H)?)?

Details:

  • (?<!\S) - a left-hand whitespace boundary
  • (\d+(?:,\d+)?) - Group 1: an int or float numeric value
  • * - zero or more spaces
  • (?:(?:in(?:ch)?|")(?: +W)?)? - an optional sequence of in, inch or " that are optionally followed by one or more spaces and W
  • ? - an optional space
  • (?:x|by) - x or by
  • ? - an optional space
  • (\d+(?:,\d+)?)(?: ?x ?\d+(?:,\d+)?)* *(?:(?:in(?:ch)?|")(?: +L)?)?(?: ?x ?(\d+(?:,\d+)?))* *(?:(?:in(?:ch)?|")(?: +H)?)? - two more optional repetitions of the similar pattern sequences as described above (L and H are used instead of W), the numeric values are captured into Group 2 and 3.

Upvotes: 3

Related Questions