Hossein
Hossein

Reputation: 25924

How to include or exclude some specific patterns in regex?

I am trying to match parts of a URL. First I am trying to get a match only for something like this:

http://Stackoverflow.com/questions/blah/balh.blah  
http://www.stackoverflow.com/questions/blah/balh.blah  
stackoverflow.com/questions/blah/balh.blah  
www.stackoverflow.com/  

But I want to use other protocols such as https and ftp as well. I myself wrote something like this which is not good at all :

((http:\/\/|https:\/\/|ftp:\/\/)*)((www.)*)([a-z]+).([a-z]{2,3})(\/)*

There are lots of problems with this regex, and I need to figure out how to get it fixed.
First, How can I specify for example that only http or https are valid and not htttp or hazzzzt or etc?
To be more precise:

  1. How can we specify a specific word to be included or excluded?

What is clear now is that (http) is not treated like a word, it is just a class set of characters, so any word that has only one of those letters gets a match. I have read about \b that works as a word boundary, but it seems \bhttp\b doesn't actually mean treat http as a single word rather than a set of characters!

And for the www part, matches wwww and ww or any other number of ws! I am always getting a match no matter what I input! I use http://regex101.com/ to test the regex.

Upvotes: 1

Views: 202

Answers (3)

zx81
zx81

Reputation: 41838

Hossein, there are several points and questions in your question.

A. How to include or exclude some specific patterns in regex?

There are many techniques. For simple patterns, you specify what you want, or you specify what you don't want, either with negative character classes or negative lookaround. For more intricate patterns, a great place to start is Match (or replace) a pattern except in situations s1, s2, s3 etc

B. How can a specific word be included or excluded?

In general, to make sure a specific word belongs or doesn't belong to a string, if you don't know its placement, you do a lookahead (or negative lookahead) at the beginning of the string:

^(?=.*?MyWord)   # makes sure the word is there

or

^(?!.*?MyWord)   # makes sure the word is not there

C. What is clear now, is that (http) is not treated like a word, it is just a class set of characters, so any word that has only one of those letters gets a match

That is not correct. (http) will only match http. It will not match ptth, for instance. Perhaps you are thinking of [http], which would be a character class allowing characters h, t and p to be matched once (and inefficient since [pth] would do)

D. How to Match the Parts of a URL

There are many solutions to this, but for today I'd suggest not reinventing the wheel. May I suggest the regex in the RegexBuddy library for this purpose? It is

(?i)\b((?#protocol)https?|ftp)://((?#domain)[-A-Z0-9.]+)((?#file)/[-A-Z0-9+&@#/%=~_|!:,.;]*)?((?#parameters)\?[A-Z0-9+&@#/%=~_|!:,.;]*)?

Here follows a token-by-token explanation (I added the case-insensitive (?i) modifier at the beginning.)

  • Assert position at a word boundary (position preceded or followed—but not both—by a Unicode letter, digit, or underscore) \b
  • Match the regex below and capture its match into backreference number 1 ((?#protocol)https?|ftp)
    • Match this alternative (attempting the next alternative only if this one fails) (?#protocol)https?
      • Comment: protocol (?#protocol)
      • Match the character string “http” literally (case insensitive) http
      • Match the character “s” literally (case insensitive) s?
        • Between zero and one times, as many times as possible, giving back as needed (greedy) ?
    • Or match this alternative (the entire group fails if this one fails to match) ftp
      • Match the character string “ftp” literally (case insensitive) ftp
  • Match the character string “://” literally ://
  • Match the regex below and capture its match into backreference number 2 ((?#domain)[-A-Z0-9.]+)
    • Comment: domain (?#domain)
    • Match a single character present in the list below [-A-Z0-9.]+
      • Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
      • The literal character “-” -
      • A character in the range between “A” and “Z” (case insensitive) A-Z
      • A character in the range between “0” and “9” 0-9
      • The literal character “.” .
  • Match the regex below and capture its match into backreference number 3 ((?#file)/[-A-Z0-9+&@#/%=~_|!:,.;]*)?
    • Between zero and one times, as many times as possible, giving back as needed (greedy) ?
    • Comment: file (?#file)
    • Match the character “/” literally /
    • Match a single character present in the list below [-A-Z0-9+&@#/%=~_|!:,.;]*
      • Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
      • The literal character “-” -
      • A character in the range between “A” and “Z” (case insensitive) A-Z
      • A character in the range between “0” and “9” 0-9
      • A single character from the list “+&@#/%=~_|!:,.;” +&@#/%=~_|!:,.;
  • Match the regex below and capture its match into backreference number 4 ((?#parameters)\?[A-Z0-9+&@#/%=~_|!:,.;]*)?
    • Between zero and one times, as many times as possible, giving back as needed (greedy) ?
    • Comment: parameters (?#parameters)
    • Match the character “?” literally \?
    • Match a single character present in the list below [A-Z0-9+&@#/%=~_|!:,.;]*
      • Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
      • A character in the range between “A” and “Z” (case insensitive) A-Z
      • A character in the range between “0” and “9” 0-9
      • A single character from the list “+&@#/%=~_|!:,.;” +&@#/%=~_|!:,.;

Upvotes: 2

mrjamesmyers
mrjamesmyers

Reputation: 494

Dont think you need the outer parenthesis e.g below is to match http:// or www. (Make sure you escape the period)

(http:\/\/|www\.)

Also if you are using preg_match there are slight difference to apache .htaccess for jnstance you use a character to indicate start and end of pattern like a #

$regEx = '#(http:\/\/|www\.)#';

Upvotes: 1

Master Bee
Master Bee

Reputation: 1099

Maybe you can use the PHP filter function?

if (filter_var($url, FILTER_VALIDATE_URL) !== false)

FILTER_VALIDATE_URL validates URLs according to RFC 2396.

http://www.php.net/manual/de/filter.filters.validate.php

Upvotes: 0

Related Questions