Reputation: 2956
I'll put it right out there: I'm terrible with regular expressions. I've tried to come up with one to solve my problem but I really don't know much about them. . .
Imagine some sentences along the following lines:
- Hello blah blah. It's around 11 1/2" x 32".
- The dimensions are 8 x 10-3/5!
- Probably somewhere in the region of 22" x 17".
- The roll is quite large: 42 1/2" x 60 yd.
- They are all 5.76 by 8 frames.
- Yeah, maybe it's around 84cm long.
- I think about 13/19".
- No, it's probably 86 cm actually.
I want to, as cleanly as possible, extract item dimensions from within these sentences. In a perfect world the regular expression would output the following:
- 11 1/2" x 32"
- 8 x 10-3/5
- 22" x 17"
- 42 1/2" x 60 yd
- 5.76 by 8
- 84cm
- 13/19"
- 86 cm
I imagine a world where the following rules apply:
{cm, mm, yd, yards, ", ', feet}
, though I'd prefer a solution that considers an arbitrary set of units rather than an explicit solution for the above units.4/5"
./
separating the numerator / denominator, and one can assume there is no space between the parts (though if someone takes that in to account that's great!).{x, by}
. If a dimension is only one-dimensional it must have units from the set above, i.e., 22 cm
is OK, .333
is not, nor is 4.33 oz
.To show you how useless I am with regular expressions (and to show I at least tried!), I got this far. . .
[1-9]+[/ ][x1-9]
Update (2)
You guys are very fast and efficient! I'm going to add an extra few of test cases that haven't been covered by the regular expressions below:
- The last but one test case is 12 yd x.
- The last test case is 99 cm by.
- This sentence doesn't have dimensions in it: 342 / 5553 / 222.
- Three dimensions? 22" x 17" x 12 cm
- This is a product code: c720 with another number 83 x better.
- A number on its own 21.
- A volume shouldn't match 0.332 oz.
These should result in the following (# indicates nothing should match):
- 12 yd
- 99 cm
- #
- 22" x 17" x 12 cm
- #
- #
- #
I've adapted M42's answer below, to:
\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet)(?:\s*x\s*|\s*by\s*)?(?:\d+(?:\.\d+)?[\s*-]*(?:\d+(?:\/\d+)?)?(?:cm|mm|yd|"|'|feet)?)?
But while that resolves some new test cases it now fails to match the following others. It reports:
- 11 1/2" x 32" PASS
- (nothing) FAIL
- 22" x 17" PASS
- 42 1/2" x 60 yd PASS
- (nothing) FAIL
- 84cm PASS
- 13/19" PASS
- 86 cm PASS
- 22" PASS
- (nothing) FAIL
(nothing) FAIL
12 yd x FAIL
- 99 cm by FAIL
- 22" x 17" [and also, but separately '12 cm'] FAIL
PASS
PASS
Upvotes: 7
Views: 2790
Reputation: 91508
New version, near the target, 2 failed tests
#!/usr/local/bin/perl
use Modern::Perl;
use Test::More;
my $re1 = qr/\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet)/;
my $re2 = qr/(?:\s*x\s*|\s*by\s*)/;
my $re3 = qr/\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet|frames)/;
my @out = (
'11 1/2" x 32"',
'8 x 10-3/5',
'22" x 17"',
'42 1/2" x 60 yd',
'5.76 by 8 frames',
'84cm',
'13/19"',
'86 cm',
'12 yd',
'99 cm',
'no match',
'22" x 17" x 12 cm',
'no match',
'no match',
'no match',
);
my $i = 0;
my $xx = '22" x 17"';
while(<DATA>) {
chomp;
if (/($re1(?:$re2$re3)?(?:$re2$re1)?)/) {
ok($1 eq $out[$i], $1 . ' in ' . $_);
} else {
ok($out[$i] eq 'no match', ' got "no match" in '.$_);
}
$i++;
}
done_testing;
__DATA__
Hello blah blah. It's around 11 1/2" x 32".
The dimensions are 8 x 10-3/5!
Probably somewhere in the region of 22" x 17".
The roll is quite large: 42 1/2" x 60 yd.
They are all 5.76 by 8 frames.
Yeah, maybe it's around 84cm long.
I think about 13/19".
No, it's probably 86 cm actually.
The last but one test case is 12 yd x.
The last test case is 99 cm by.
This sentence doesn't have dimensions in it: 342 / 5553 / 222.
Three dimensions? 22" x 17" x 12 cm
This is a product code: c720 with another number 83 x better.
A number on its own 21.
A volume shouldn't match 0.332 oz.
output:
# Failed test ' got "no match" in The dimensions are 8 x 10-3/5!'
# at C:\tests\perl\test6.pl line 42.
# Failed test ' got "no match" in They are all 5.76 by 8 frames.'
# at C:\tests\perl\test6.pl line 42.
# Looks like you failed 2 tests of 15.
ok 1 - 11 1/2" x 32" in Hello blah blah. It's around 11 1/2" x 32".
not ok 2 - got "no match" in The dimensions are 8 x 10-3/5!
ok 3 - 22" x 17" in Probably somewhere in the region of 22" x 17".
ok 4 - 42 1/2" x 60 yd in The roll is quite large: 42 1/2" x 60 yd.
not ok 5 - got "no match" in They are all 5.76 by 8 frames.
ok 6 - 84cm in Yeah, maybe it's around 84cm long.
ok 7 - 13/19" in I think about 13/19".
ok 8 - 86 cm in No, it's probably 86 cm actually.
ok 9 - 12 yd in The last but one test case is 12 yd x.
ok 10 - 99 cm in The last test case is 99 cm by.
ok 11 - got "no match" in This sentence doesn't have dimensions in it: 342 / 5553 / 222.
ok 12 - 22" x 17" x 12 cm in Three dimensions? 22" x 17" x 12 cm
ok 13 - got "no match" in This is a product code: c720 with another number 83 x better.
ok 14 - got "no match" in A number on its own 21.
ok 15 - got "no match" in A volume shouldn't match 0.332 oz.
1..15
It seems difficult to match 5.76 by 8 frames
but not 0.332 oz
, sometimes you have to match numbers with unit and numbers without unit.
I'm sorry, I'm not able to do better.
Upvotes: 5
Reputation: 36282
This is all what I can get with a regular expression in 'Perl'. Try to adapt it to your regex flavour:
\d.*\d(?:\s+\S+|\S+)
Explanation:
\d # One digit.
.* # Any number of characters.
\d # One digit. All joined means to find all content between first and last digit.
\s+\S+ # A non-space characters after some space. It tries to match any unit like 'cm' or 'yd'.
| # Or. Select one of two expressions between parentheses.
\S+ # Any number of non-space characters. It tries to match double-quotes, or units joined to the
# last number.
My test:
Content of script.pl:
use warnings;
use strict;
while ( <DATA> ) {
print qq[$1\n] if m/(\d.*\d(\s+\S+|\S+))/
}
__DATA__
Hello blah blah. It's around 11 1/2" x 32".
The dimensions are 8 x 10-3/5!
Probably somewhere in the region of 22" x 17".
The roll is quite large: 42 1/2" x 60 yd.
They are all 5.76 by 8 frames.
Yeah, maybe it's around 84cm long.
I think about 13/19".
No, it's probably 86 cm actually.
Running the script:
perl script.pl
Result:
11 1/2" x 32".
8 x 10-3/5!
22" x 17".
42 1/2" x 60 yd.
5.76 by 8 frames.
84cm
13/19".
86 cm
Upvotes: 2
Reputation: 26940
One of many possible solutions (should be nlp compatible as it uses only basic regex syntax):
foundMatch = Regex.IsMatch(SubjectString, @"\d+(?: |cm|\.|""|/)[\d/""x -]*(?:\b(?:by\s*\d+|cm|yd)\b)?");
Will get your results :)
Explanation:
"
\d # Match a single digit 0..9
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
\ # Match the character “ ” literally
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
cm # Match the characters “cm” literally
| # Or match regular expression number 3 below (attempting the next alternative only if this one fails)
\. # Match the character “.” literally
| # Or match regular expression number 4 below (attempting the next alternative only if this one fails)
"" # Match the character “""” literally
| # Or match regular expression number 5 below (the entire group fails if this one fails to match)
/ # Match the character “/” literally
)
[\d/""x -] # Match a single character present in the list below
# A single digit 0..9
# One of the characters “/""x”
# The character “ ”
# The character “-”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
(?: # Match the regular expression below
\b # Assert position at a word boundary
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
by # Match the characters “by” literally
\s # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\d # Match a single digit 0..9
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
cm # Match the characters “cm” literally
| # Or match regular expression number 3 below (the entire group fails if this one fails to match)
yd # Match the characters “yd” literally
)
\b # Assert position at a word boundary
)? # Between zero and one times, as many times as possible, giving back as needed (greedy)
"
Upvotes: 2