user656925
user656925

Reputation:

How to update a regex to not consider order in an or ( | ) statement or how to emulate:

This regex

(<link\s+)((rel="[Ii]con"\s+)|(rel="[Ss]hortcut [Ii]con"\s+))(href="(.+)")(.+)/>

works for

<link rel="icon" href="http://passets-cdn.pinterest.com/images/favicon.png" type="image/x-icon" />
<link rel="shortcut icon" href="http://css.nyt.com/images/icons/nyt.ico" />
<link rel="shortcut icon" href="http://cdn.sstatic.net/careers/Img/favicon.ico?36da6b" />
<link rel="Shortcut Icon" href="/favicon.ico" type="image/x-icon" />

but not for where the href and rel attributes are switched:

  <link href="/phoenix/favicon.ico" rel="shortcut icon" type="image/x-icon" />

How can I update it so the or statements are not ordered

so that

aa || bb

works just as well as

bb || aa

Test here:

http://regexpal.com/

I just want to pull the path from the favicon tag...I've chosen not to use a library.

Stema's answer in different form:

<link\s+
    (
        ?=[^>]*rel="
        (
            ?:[Ss]hortcut\s
        )
        ?[Ii]con"\s+
    )
    (
        ?:[^>]*href="
        (
            .+?
        )"
    ).*
/>

Upvotes: 1

Views: 90

Answers (4)

stema
stema

Reputation: 93026

You could do it with a positive lookahead

<link\s+(?=[^>]*rel="(?:[Ss]hortcut\s)?[Ii]con"\s+)(?:[^>]*href="(.+?)").*/>

See it here on Regexr

You will find the path in the first capturing group.

The thing here is, that the lookahead is not matching anything. So you can check if somewhere within the tag there is rel="(?:[Ss]hortcut\s)?[Ii]con" and if this pattern is found it will match the href part and put the link into the capturing group 1.

(?=[^>]*rel="(?:[Ss]hortcut\s)?[Ii]con"\s+) this is the positive lookahead assertion. Thats indicated by the ?= at the start of the group.

[^>] is a negated character class, that matches any character but the >. I use this to ensure that it does not pass the closing > of the tag.

Upvotes: 3

Scott Saunders
Scott Saunders

Reputation: 30414

You can use one regex to locate the icon tag and a second regex to pull the path.

If the only text that your second regex parses is a single tag it can be as simple as /href="(.+)"/ and the order of attributes within the tag will not matter.

Upvotes: 2

lanzz
lanzz

Reputation: 43178

You cannot, not with a single regular expressions. Well, you actually can, but it is really not worth it, and you will end up with an unreadable mess of a regex.

Match against /<link\s([^>]+rel="(shortcut\s+)?icon"[^>]*)>/i and then match the captured part against /\shref="([^"]+)"/i.

Upvotes: 4

gen_Eric
gen_Eric

Reputation: 227280

I suggest using PHP's SimpleXML.

$html = '<link href="/phoenix/favicon.ico" rel="shortcut icon" type="image/x-icon" />';
$xml = new SimpleXMLElement($html);
echo $xml->attributes()->href;

Upvotes: 1

Related Questions