Reputation: 6131
I have a set of n tokens (e.g., a, b, c) distributed among a bunch of other tokens. I would like to know if all members of my set occur within a given number of positions (window size). It occurred to me that it may be possible to write a RegEx to capture this state, but the exact syntax eludes me.
11111 012345678901234 ab ab bc a cba
In this example, given window size=5, I would like to match cba
at positions 12-14, and abc
in positions 3-7.
Is there a way to do this with RegEx, or is there some other kind of grammar that I can use to capture this logic?
I am hoping to implement this in Java.
Upvotes: 2
Views: 2162
Reputation: 75252
Pattern p = Pattern.compile("(?:a()|b()|c()|.){5}\\1\\2\\3");
String s = "ab ab bc a cba";
Matcher m = p.matcher(s);
while (m.find())
{
System.out.println(m.group());
}
output:
ab bc
a cb
This is inspired by Recipe #5.7 in Regular Expressions Cookbook. Each back-reference (\1
, \2
, \3
) acts like a zero-width assertion, indicating that the corresponding capturing group participated in the match, even though the group itself didn't consume any characters.
The authors warn that this trick relies on behavior that's undocumented in most flavors. It works in Java, .NET, Perl, PHP, Python and Ruby (original and Oniguruma), but not in JavaScript or ActionScript.
Upvotes: 0
Reputation: 34425
This tested Java program has a commented regex which does the trick:
import java.util.regex.*;
public class TEST {
public static void main(String[] args) {
String s = "ab ab bc a cba";
Pattern p = Pattern.compile(
"# Match 5 char sequences containing: a and b and c\n" +
"(?=[abc]) # Assert first char is a, b or c.\n" +
"(?=.{0,4}a) # Assert an 'a' within 5 chars.\n" +
"(?=.{0,4}b) # Assert an 'b' within 5 chars.\n" +
"(?=.{0,4}c) # Assert an 'c' within 5 chars.\n" +
".{5} # If so, match the 5 chers.",
Pattern.COMMENTS);
Matcher m = p.matcher(s);
while (m.find()) {
System.out.print("Match = \""+ m.group() +"\"\n");
}
}
}
Note that there is another valid sequence S9:13" a cb"
in your test data (before the S12:14"cba"
. Assuming you did not want to match this one, I added an additional constraint to filter it out, which requires that the 5 char window must begin with an a
, b
or c
.
Here is the output from the script:
Match = "ab bc"
Match = "a cba"
Upvotes: 2
Reputation: 2897
Here's a regex that matches 5-letter sequences that include all of 'a', 'b' and 'c':
(?=.{0,4}a)(?=.{0,4}b)(?=.{0,4}c).{5}
So, while basically matching any 5 characters (with .{5}
), there are three preconditions the matches have to observe. Each of them requires one of the tokens/letters to be present (up to 4 characters followed by 'a', etc.). (?=X)
matches "X, with a zero-width positive look-ahead", where zero-width means that the character position is not moved while matching.
Doing this with regexes is slow, though.. Here's a more direct version (seems about 15x faster than using regular expressions):
public static void find(String haystack, String tokens, int windowLen) {
char[] tokenChars = tokens.toCharArray();
int hayLen = haystack.length();
int pos = 0;
nextPos:
while (pos + windowLen <= hayLen) {
for (char c : tokenChars) {
int i = haystack.indexOf(c, pos);
if (i < 0) return;
if (i - pos >= windowLen) {
pos = i - windowLen + 1;
continue nextPos;
}
}
// match found at pos
System.out.println(pos + ".." + (pos + windowLen - 1) + ": " + haystack.substring(pos, pos + windowLen));
pos++;
}
}
Upvotes: 2
Reputation: 272677
Well, one possibility (albeit a completely impractical one) is simply to match against all permutations:
abc..|ab.c.|ab..c| .... etc.
This can be factorised somewhat:
ab(c..|.c.|..c)|a.(bc.|b.c .... etc.
I'm not sure if you can do better with regex.
Upvotes: 1