Reputation: 98
I need to build a regex that match words with these patterns:
Letters and numbers:
A35, 35A, B503X, 1ABC5
Letters and numbers separated by "-", "/", "\":
AB-10, 10-AB, A10-BA, BA-A10, etc...
I wrote this regex for it:
\b[A-Za-z]+(?=[(?<!\-|\\|\/)\d]+)[(?<!\-|\\|\/)\w]+\b|\b[0-9]+(?=[(?<!\-|\\|\/)A-Za-z]+)[(?<!\-|\\|\/)\w]+\b
It works partially, but it's match only letters or only numbers separated by symbols. Example:
10-10, open-office, etc.
And I don't wanna this matches.
I guess that my regex is very repetitive and somewhat ugly. But it's what I have for now.
Could anyone help me?
I'm using java/groovy.
Thanks in advance.
Upvotes: 2
Views: 7706
Reputation: 27585
Excuse me to write my solution in Python, I don't know enough Java to write in Java.
pat = re.compile('(?=(?:([A-Z])|[0-9])' ## This part verifies that
'[^ ]*' ## there are at least one
'(?(1)\d|[A-Z]))' ## letter and one digit.
'('
'(?:(?<=[ ,])[A-Z0-9]|\A[A-Z0-9])' # start of second group
'[A-Z0-9-/\\\\]*'
'[A-Z0-9](?= |\Z|,)' # end of second group
')',
re.IGNORECASE) # this group 2 catches the string
.
My solution catches the desired string in the second group: ((?:(?<={ ,])[A-Z0-9]|\A[A-Z0-9])[A-Z0-9-/\\\\]*[A-Z0-9](?= |\Z|,))
.
The part before it verifies that one letter at least and one digit at least are present in the catched string:
(?(1)\d|[A-Z])
is a conditional regex that means "if group(1) catched something, then there must be a digit here, otherwise there must be a letter"
The group(1) is ([A-Z])
in (?=(?:([A-Z])|[0-9])
(?:([A-Z])|[0-9])
is a non-capturing group that matches a letter (catched) OR a digit, so when it matches a letter, the group(1) isn't empty
.
The flag re.IGNORECASE allows to treat strings with upper or lower cased letters.
.
In the second group, I am obliged to write (?:(?<=[ ,])[A-Z0-9]|\A[A-Z0-9])
because lookbehind assertions with non fixed length are not allowed. This part signifies one character that can't be '-' preceded by a blank or the head of the string.
At the opposite, (?= |\Z[,)
means 'end of string or a comma or a blank after'
.
This regex supposes that the characters '-'
, '/'
, '\'
can't be the first character or the last one of a captured string . Is it right ?
import re
pat = re.compile('(?=(?:([A-Z])|[0-9])' ## (from here) This part verifies that
'[^ ]*' # there are at least one
'(?(1)\d|[A-Z]))' ## (to here) letter and one digit.
'((?:(?<=[ ,])[A-Z0-9]|\A[A-Z0-9])'
'[A-Z0-9-/\\\\]*'
'[A-Z0-9](?= |\Z|,))',
re.IGNORECASE) # this group 2 catches the string
ch = "ALPHA13 10 ZZ 10-10 U-R open-office ,10B a10 UCS5000 -TR54 code vg4- DV-3000 SEA 300-BR gt4/ui bn\\3K"
print [ mat.group(2) for mat in pat.finditer(ch) ]
s = "A35, 35A, B503X,1ABC5 " +\
"AB-10, 10-AB, A10-BA, BA-A10, etc... " +\
"10-10, open-office, etc."
print [ mat.group(2) for mat in pat.finditer(s) ]
result
['ALPHA13', '10B', 'a10', 'UCS5000', 'DV-3000', '300-BR', 'gt4/ui', 'bn\\3K']
['A35', '35A', 'B503X', '1ABC5', 'AB-10', '10-AB', 'A10-BA', 'BA-A10']
Upvotes: 1
Reputation: 36229
A condition (A OR NOT A) can be omited. So symbols can savely been ignored.
for (String word : "10 10-10 open-office 10B A10 UCS5000 code DV-3000 300-BR".split (" "))
if (word.matches ("(.*[A-Za-z].*[0-9])|(.*[0-9].*[A-Za-z].*)"))
// do something
You didn't mention -x4, 4x-, 4-x-, -4-x or -4-x-, I expect them all to match.
My expression looks just for something-alpha-something-digits-something, where something might be alpha, digits or symbols, and the opposite: something-alpha-something-digits-something. If something else might occur, like !#$~()[]{} and so on, it would get longer.
Tested with scala:
scala> for (word <- "10 10-10 open-office 10B A10 UCS5000 code DV-3000 300-BR".split (" ")
| if word.matches ("(.*[A-Za-z].*[0-9])|(.*[0-9].*[A-Za-z].*)")) yield word
res89: Array[java.lang.String] = Array(10B, A10, UCS5000, DV-3000, 300-BR)
Slightly modified to filter matches:
String s = "A35, 35A, B53X, 1AC5, AB-10, 10-AB, A10-BA, BA-A10, etc. -4x, 4x- -4-x- 10-10, oe-oe, etc";
Pattern pattern = java.util.regex.Pattern.compile ("\\b([^ ,]*[A-Za-z][^ ,]*[0-9])[^ ,]*|([^ ,]*[0-9][^ ,]*[A-Za-z][^ ,]*)\\b");
matcher = pattern.matcher (s);
while (matcher.find ()) { System.out.print (matcher.group () + "|") }
But I still have an error, which I don't find:
A35|35A|B53X|1AC5|AB-10|10-AB|A10-BA|BA-A10|-4x|4x|-4-x|
4x should be 4x-, and -4-x should be -4-x-.
Upvotes: 0
Reputation: 3167
My first pass yields
(^|\s)(?!\d+[-/\\]?\d+(\s|$))(?![A-Z]+[-/\\]?[A-Z]+(\s|$))([A-Z0-9]+[-/\\]?[A-Z0-9]+)(\s|$)
Sorry, but it's not java formatted (you'll need to edit the \ \s etc.). Also, you can't use \b
b/c a word boundary is anything that is not alphanumeric and underscore, so I used \s
and the start and end of the string.
This is still a bit raw
EDIT
Version 2, slightly better, but could be improved for performance by usin possessive quantifiers. It matches ABC76
AB-32
3434-F
etc, but not ABC
or 19\23
etc.
((?<=^)|(?<=\s))(?!\d+[-/\\]?\d+(\s|$))(?![A-Z]+[-/\\]?[A-Z]+(\s|$))([A-Z0-9]+[-/\\]?[A-Z0-9]+)((?=$)|(?=\s))
Upvotes: 0
Reputation: 34395
Interesting challenge. Here is a java program with a regex that picks out the types of "words" you are after:
import java.util.regex.*;
public class TEST {
public static void main(String[] args) {
String s = "A35, 35A, B503X, 1ABC5 " +
"AB-10, 10-AB, A10-BA, BA-A10, etc... " +
"10-10, open-office, etc.";
Pattern regex = Pattern.compile(
"# Match special word having one letter and one digit (min).\n" +
"\\b # Match first word having\n" +
"(?=[-/\\\\A-Za-z]*[0-9]) # at least one number and\n" +
"(?=[-/\\\\0-9]*[A-Za-z]) # at least one letter.\n" +
"[A-Za-z0-9]+ # Match first part of word.\n" +
"(?: # Optional extra word parts\n" +
" [-/\\\\] # separated by -, / or //\n" +
" [A-Za-z0-9]+ # Match extra word part.\n" +
")* # Zero or more extra word parts.\n" +
"\\b # Start and end on a word boundary",
Pattern.COMMENTS);
Matcher regexMatcher = regex.matcher(s);
while (regexMatcher.find()) {
System.out.print(regexMatcher.group() + ", ");
}
}
}
Here is the correct output:
A35, 35A, B503X, 1ABC5, AB-10, 10-AB, A10-BA, BA-A10,
Note that the only complex regexes which are "ugly", are those that are not properly formatted and commented!
Upvotes: 6
Reputation: 152246
Just use this:
([a-zA-Z]+[-\/\\]?[0-9]+|[0-9]+[-\/\\]?[a-zA-Z]+)
In Java \\
and \/
should be escaped:
([a-zA-Z]+[-\\\/\\\\]?[0-9]+|[0-9]+[-\\\/\\\\]?[a-zA-Z]+)
Upvotes: 1