Reputation: 2658

how to exclude a character in a regex pattern character class if last character?

Sorry if the question is phrased poorly (or if already asked. I really tried to find it).

Is it possible, if one specific character in a character class happens to be the last character (though it can still remain elsewhere) to exclude it from the match? What I am working with is similar to finding urls in larger strings, and need to include periods in the pattern but IF the last character is a period, exclude it as the end of a sentence.

So in a pattern (other url stuff) "(/[a-zA-Z0-9._-]*)?" is there a way to exclude ONLY the last period, if present? Note that the above would specifically be the uri segments after the domain, but I want to match only

"/some_uri/segments.php"

in both

"www.domain.com/some_uri/segments.php"

AND

"www.domain.com/some_uri/segments.php."

while allowing for more than one period to exist in the uri.

If the above isn't clear, imagine I am asking for a way to exclude the final letter in a word, if and only if it is a 'z'. So 'dozzer' and 'dozzerz' both match as 'dozzer' inside a sentence structure (so... no matching to the position at the END of a string). I've played around with lookaheads and the like, but haven't found a way yet. I'm wondering if it's not possible (in just a single regex).

Thanks for your time!

EDIT

I apologize for not making it clearer, but I need to perform the match inside of a BLOCK of text. What I'm doing is going through text and finding all the web addresses and applying markup to them. Thus I CAN'T utilize positional operators, such as $ to match the end of the string. Which has been the biggest problem.

Unless someone else posts an answer that works after this, I think I'm going to have to agree with M477h3w1012 and conclude that it can't be accomplished inside the regex alone. I'm going to need to perform a conditional check after finding matches to determine if they have a trailing period. But thank you all, again, very much for your time and help. :-)

Upvotes: 3

Answers (3)

Casimir et Hippolyte

Reputation: 89557

EDIT: as Adi Inbar notices it, your goal isn't to make the pattern fail but to exclude a particular character at the end of a string or at the end of a word:

to exclude a 'z' at the end of a word: (several 'z' at the end are excluded too)

with a character class and possessive quantifiers:

(?>[^\Wz]++|z++\B)+ # the most performant way

to exclude a '.' at the end of a string: (several '.' at the end are excluded too)

with a lookahead:

^.+?(?=\.*$)

or with a character class and possessive quantifiers:

(?>[^.]++|\.++(?!$))+

note that you can easily adapt this expression to the more specific character class you need, example with [\w.-] for an uri:

$pattern = '~(?>/[\w.-]++)*/(?>[\w-]++|\.++(?!$))+/?~';

Upvotes: 0

Adi Inbar

Reputation: 12323

Yes. In general terms, do this:

(<stuff you want to match>)(<character to exclude if at the end>)?$

If <stuff you want to match> ends in a quantifier, that quantifier needs to be non-greedy so that the excluded last character will be matched if it exists.

Then use the first match group (the $1 variable).

However, I see a couple of other problems with your regex.

You need to include / in your character class if you want to be able to match more than one. Otherwise you're just matching from the first / until right before the next one.
I'm not sure why you have a ? at the end. That makes the entire thing optional.

This regex will accomplish what you described:

(/[a-zA-Z0-9._/-]*?)(\.)?$

The match variable $1 will contain everything starting with the first / to the end, but excluding a final dot if there is one (the dot will be in $2).

Upvotes: 1

J4Numbers

Reputation: 428

I don't think it's possible in a single regex check... someone might be able to correct me on that, but I don't think so at the moment (Or I can't think to optimise things at the moment).

What you can do, on the other hand, is run a check. Run the input through an initial replace function first to see whether there is a dot at the end or not and replace it if there is one. From there you can just feed it through the previous regex.

So this is how it could go...

function dotCheck( $url ) {
  $noDotURL = preg_replace( '/\.+$/', '', $url );
  return $noDotURL;
}

urlCheck( dotCheck( $_POST['form'] ) );

Where urlCheck is the main check to see whether it is a valid link structure or not. The regex - in verbose form - checks for any dots as the last characters in the link and deletes them. This should work if someone typed in http://www.google.com. or http://www.google.com.....

Happy scripting.

Upvotes: 1

how to exclude a character in a regex pattern character class if last character?

Answers (3)

Related Questions