Reputation: 723
I have php code that asks for a search-term, splits it, and generates a regex to match (and highlight) the pattern. For example:
If I enter ou
, it generates following pattern: (o)(.*)(u)
. It then replaces it with <em>$1</em>$2<em>$3</em>
.
In following data
boau #fie diu1^^j dauijz16 abc123 wwx,usq
this would have the following effect:
b<strong>o</strong>au #fie diu1^^j dauijz16 abc123 wwx,<strong>u</strong>sq
The problem is that I would like to be able to limit for example the number of spaces allowed in the match. For example, if I limit spaces to 3, that would have following result:
b<strong>o</strong>au #fie diu1^^j da<strong>u</strong>ijz16 abc123 wwx,usq
Or a limit of 3 spaces an max 1 ^
:
b<strong>o</strong>au #fie di<strong>u</strong>1^^j dauijz16 abc123 wwx,usq
Or, don't allow any digits:
b<strong>o</strong>au #fie di<strong>u</strong>1^j dauijz16 abc123 wwx,usq
So I would like to be able to enter the pattern to search for, and specify a separate limit for certain characters, I have no idea on how to do this though. I think it'll have something to do with a lookahead, but I can't figure out how to use those.
Upvotes: 1
Views: 199
Reputation: 91375
To limit the number of spaces, I'd use:
(o)((?:\S*\s){0,3}\S*)(u)
Here is a perl script that uses it:
my $re = qr/(o)((?:\S*\s){0,3}\S*)(u)/;
my $str = 'boau #fie d iu1^^j dauij z16 abc123 wwx,usq';
$str =~ s!$re!<em>$1</em>$2<em>$3</em>!;
say $str;
output:
b<em>o</em>au #fie d i<em>u</em>1^^j dauij z16 abc123 wwx,usq
Explanation:
The regular expression:
(?-imsx:(o)((?:\S*\s){0,3}.*?)(u))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
o 'o'
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
(?: group, but do not capture (between 0 and
3 times (matching the most amount
possible)):
----------------------------------------------------------------------
\S* non-whitespace (all but \n, \r, \t,
\f, and " ") (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
){0,3} end of grouping
----------------------------------------------------------------------
\S* non-whitespace (all but \n, \r, \t, \f,
and " ") (0 or more times (matching the
most amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
u 'u'
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
Upvotes: 1
Reputation: 71538
You can make use of negated classes:
(o)((?:[^ ]* ){0,3}[^ ]*)(u)
For limiting at 3 spaces.
(o)(\D*)(u)
For no digits. \D
matches any character except digits. Note that it is equivalent to the negated class: [^\d]
.
The second requirement is much more complex than the above two:
(o)([^ ^]*(?:(\^)|( ))?[^ ^]*(?(3) |(?:( )|(\^)))?[^ ^]*(?(6) |(?:( )|(\^)))?[^ ^]*(?(8) |\^)?[^ ^]*)(u)
It tries to match either a ^
or a space and depending on what it captures, it will decide whether it can match another space or caret or none.
This regex makes use of conditional groups, which is not supported by all regex engines.
As you can see, one limitation is quite easy, but multiple will be getting out of hand quickly. I would suggest a state machine if you have multiple conditions, for example, in pseudo code:
match first character "o"
substring = "o"
statecaret = 0
statespace = 0
for (check next character)
if character == "^"
statecaret = statecaret + 1
else if character == " "
statespace = statespace + 1
if (statecaret = 2 || statespace = 4)
break and reject character
else
add character to substring
find last "u" in substring
Upvotes: 0
Reputation: 784938
You have asked quite a few questions here.
I am going to answer the one that appears most complex i.e. if I limit spaces to 3:
You can use this regex:
$s = 'boau #fie diu1^^j dauijz16 abc123 wwx,usq';
$r = preg_replace('/(o)((?:[^ ]* ){0,3}[^ u]*)(u)/', "<em>$1</em>$2<em>$3</em>", $s);
//=> b<em>o</em>au #fie diu1^^j da<em>u</em>ijz16 abc123 wwx,usq
Explanation:
1st Capturing group (o)
o matches the character o literally (case sensitive)
2nd Capturing group ((?:[^ ]* ){0,3}[^ u]*)
(?:[^ ]* ){0,3} Non-capturing group
Quantifier: Between 0 to 3 times
[^ ]* match a single character not present in the list below
Quantifier: Between zero and unlimited times, as many times as possible, giving back
as needed [greedy]
the literal character
matches the character literally
[^ u]* match a single character not present in the list below
Quantifier: Between zero and unlimited times, as many times as possible, giving back
as needed [greedy]
u a single character in the list u literally (case sensitive)
3rd Capturing group (u)
u matches the character u literally (case sensitive)
This output matches with your expected result. I hope you can use the same approach and build regex for other parts of your questions with this.
Upvotes: 0