Andy
Andy

Reputation: 138

preg_match_all for special characters [?]

I have a URL:

https://my.site.com/u/0/ac?export=download&confirm=45vy&id=qNhdhk1jejhXLexLpY3RiDY2oamis">D

And I want to match it using preg_match_all. My regex expression is:

preg_match_all('/(https:\/\/my\.site\.com\/[u]\/[0]\/(ac)\/(?)\/.*\">D)/', $input_lines, $output_array);

But I am not able to match special character ? in above code. I tried using (?). But it is not matching. I know it maybe a lame question, but if anyone could help me in matching ? or in escaping ? in preg_match_all, that would be helpfull.

Upvotes: 2

Views: 381

Answers (2)

Steven
Steven

Reputation: 6148

Your regex

/(https:\/\/my\.site\.com\/[u]\/[0]\/(ac)\/(?)\/.*\">D)/
^                           ^    ^    ^   ^ ^    ^     ^
1                           2    2    3   4 5    6     1
+-- Starting delimiter      |    |    |   | |    |     +-- Ending delimiter
                            |    |    |   | |    +-- This is a greedy match and may not stop where intended
                            |    |    |   | +-- `?` is a special character in Regex and does nothing in this scenario; the .* is actually matching the `?`
                            |    |    |   +-- This slash doesn't exist
                            |    |    +-- No need for a capture group
                            +----+-- No need for a character set
  1. Regular expression pattern delimiters:

    • ...mark the start and end of a pattern; similar to single/double quotes marking the start and end of strings

    • As with quotes if you use the delimiter in the pattern you have to escape it

    • To avoid escaping you can use a different delimiter

      Pattern 1: /https:\/\/www\.website\.com\/page\/1\/\index.php/
      
      Pattern 2: ~https://www\.website\.com/page/1/index\.php~
      

2.As you just want to match characters literally you can simply use the characters in the pattern. You would only need a character set if the character could be multiple values

   Set       Matched value
   u    ===> u
   [u]  ===> u
   [ua] ===> u OR a
  1. Like with 2 you don't need a capture group here because you're only interested in capturing the whole string. This would add $output_array[1] = "ac" to your output

  2. For some reason you're trying to match a / that doesn't exist in the URL so the pattern will never return anything

  3. The ? is a special character in regex; typically it is used at the start of a group (a), to modify a quantifier (b), or to imply a construct is optional (c). In this case (?) does absolutely nothing; the .* matches the literal ? or would do if the slash wasn't in the pattern.

    a. Used in a group the ? can mean, for example:

       (?:...) ===> Non-capturing group
       (?=...) ===> Positive lookahead
       (?!...) ===> Negative lookahead
    

    b. To modify a quantifier: usually a quantifier + or * would be greedy and matches as much as possible. Placing a ? after it makes it non-greedy and stops at the first possibility

    String: IIIIOIIIOIIIO
    
    Pattern       Match
    
    /I.*O/        IIIIOIIIOIIIO
    /I.*?O/       IIIIO
    

    c. To make a construct optional

    Pattern                  Match 1             Match 2             Explanation
    
    ~https?://~              http://             https://            Optional character
    ~(?:www\.)?website.com~  website.com         www.website.com     Optional non-capturing group
    
  4. As per 5b this is a greedy quantifier so, for example, if the pattern \">D was to appear more than once in a string this would match until the last occurrence.

    • i.e. if there were more than one URL in your string then it would match from the first until the last as opposed to matching them individually

      String: <a href="website.com?id=2432546t4534">Link 1</a><a href="website.com?id=24345yr6787">Link 2</a>
      
      Pattern                    Matches
      
      ~website.com\?id=.*">~     [1] website.com?id=2432546t4534">Link 1</a><a href="website.com?id=24345yr6787">
      
      ~website.com\?id=.*?">~    [1] website.com?id=2432546t4534">
                                 [2] website.com?id=24345yr6787">
      

Fix

Updated Regex

~https://my\.site\.com/u/0/ac\?.*?">D~
~                                      : Starting delimiter
 https://my\.site\.com/u/0/ac          : Matches the initial part of the URL
                             \?        : Matches a literal ?
                               .*?     : Non-greedy match any character 0 or more times
                                  ">D  : Match string literally
                                     ~ : Ending delimiter

Code

$input_lines  = 'https://my.site.com/u/0/ac?export=download&amp;confirm=45vy&amp;id=qNhdhk1jejhXLexLpY3RiDY2oamis">D';

preg_match_all('~https://my\.site\.com/u/0/ac\?.*?">D~', $input_lines, $output_array);

print_r($output_array);

Output

Array
(
    [0] => Array
        (
            [0] => https://my.site.com/u/0/ac?export=download&confirm=45vy&id=qNhdhk1jejhXLexLpY3RiDY2oamis">D
        )

)

Upvotes: 2

Akhilesh
Akhilesh

Reputation: 968

I just noticed that after ac there is not / in link but you are adding that in regex so just try to remove it or use the below code its working and tested.

<?php

$input_lines = 'https://my.site.com/u/0/ac?export=download&amp;confirm=45vy&amp;id=qNhdhk1jejhXLexLpY3RiDY2oamis">D';
preg_match_all('/(https:\/\/my\.site\.com\/[u]\/[0]\/(ac)(\?).*\">D)/', $input_lines, $output_array);

var_dump($output_array);

This is output - https://prnt.sc/weq86u

Or if there are chances that after ac/? can occur then you can try using / as optional parameter in regex

<?php

$input_lines = 'https://my.site.com/u/0/ac?export=download&amp;confirm=45vy&amp;id=qNhdhk1jejhXLexLpY3RiDY2oamis">D';
preg_match_all('/(https:\/\/my\.site\.com\/[u]\/[0]\/(ac)\/?(\?).*\">D)/', $input_lines, $output_array);

var_dump($output_array);

It will match both links with or without / https://prnt.sc/weqbae

Upvotes: 3

Related Questions