Reputation: 87
I'm trying to extract dimensions and units from text.
The data could look like anything:
53 inch x 45 inch
10 in by 5 in
53" W x 74" L x 15" H
53 inch W x 74 inch L x 15 inch H
There are posts which cover the first two cases but I was not able to understand how to deal with case 3 and 4 here.
This is what I tried for the basics from this but somehow it doesn't work:
import re
regex = r"(?<!\S)\d+(?:,\d+)?\s*(?:inch|in| in|\")* ?x ?\d+(?:,\d+)?(?: ?x ?\d+(?:,\d+)?)*\s*(?:inch| inch|in| in|\")*"
test_str = ("15 mm x 2 mm x 3")
result = re.findall(regex, test_str)
print(result)
Also, I just want to extract just these because I am using Quantulum for just extracting other numeric values but it fails in this case. So any guidance on how to merge the two things to function together would be very much appreciated.
Thank you for help
Upvotes: 5
Views: 1107
Reputation: 627126
You can use
(?<!\S)(\d+(?:,\d+)?) *(?:(?:in(?:ch)?|")(?: +W)?)? ?(?:x|by) ?(\d+(?:,\d+)?)(?: ?x ?\d+(?:,\d+)?)* *(?:(?:in(?:ch)?|")(?: +L)?)?(?: ?x ?(\d+(?:,\d+)?))* *(?:(?:in(?:ch)?|")(?: +H)?)?
See the regex demo.
Certainly, \s
is better instead of literal spaces in the pattern as it can match any whitespace:
(?<!\S)(\d+(?:,\d+)?)\s*(?:(?:in(?:ch)?|")(?:\s+W)?)?\s?(?:x|by)\s?(\d+(?:,\d+)?)(?:\s?x\s?\d+(?:,\d+)?)*\s*(?:(?:in(?:ch)?|")(?:\s+L)?)?(?:\s?x\s?(\d+(?:,\d+)?))*\s*(?:(?:in(?:ch)?|")(?:\s+H)?)?
Details:
(?<!\S)
- a left-hand whitespace boundary(\d+(?:,\d+)?)
- Group 1: an int or float numeric value *
- zero or more spaces(?:(?:in(?:ch)?|")(?: +W)?)?
- an optional sequence of in
, inch
or "
that are optionally followed by one or more spaces and W
?
- an optional space(?:x|by)
- x
or by
?
- an optional space(\d+(?:,\d+)?)(?: ?x ?\d+(?:,\d+)?)* *(?:(?:in(?:ch)?|")(?: +L)?)?(?: ?x ?(\d+(?:,\d+)?))* *(?:(?:in(?:ch)?|")(?: +H)?)?
- two more optional repetitions of the similar pattern sequences as described above (L
and H
are used instead of W
), the numeric values are captured into Group 2 and 3.Upvotes: 3