user1285419
user1285419

Reputation: 2225

how to match this pattern in php

I am looking for a regular expression in php to parse a string of the following pattern. The command are wrapped by double square bracket as

[[a src="" desc=""]]

where a, src and desc are the keywords (won't be changed). src must be given but desc is optional, the value of src or desc can be wrapped by double or single quote. And src and desc could be given in any order. For example, the following patterns are all valid

[[a src="http://a.c.d" desc ="hello"]]
[[a src   ="http://a.c.d" desc= 'hello']]
[[a desc ="hello " src=  'http://a.c.d' ]]
[[a src = "http://a.c.d" ]]
[[a    src="http://a.c.d" desc ="hello"]]

any space between value and 'a', 'src', 'desc', '=' (without quotation) should be ignored. I am going to replace this command with html tag like

SOMETHING_EXTRACT_FROM_DESC

It seems pretty tough to think of one regex to do the work. Now I have 3 regex setup to handle difference cases separately. It looks like this

$pattern = '/\[\[a[:blank:]+src[:blank:]*=[:blank:]*"(.*?)"[:blank:]+desc[:blank:]*=[:blank:]+"(.*?)"\]\]/i';
$rtn = preg_replace($pattern, '<a href="${1}">${2}</a>', $src);

$pattern = '/\[\[a[:blank:]+desc[:blank:]*=[:blank:]*"(.*?)"[:blank:]+src[:blank:]*=[:blank:]+"(.*?)"\]\]/i';
$rtn = preg_replace($pattern, '<a href="${1}">${2}</a>', $rtn);

$pattern = '/\[\[a[:blank:]+src[:blank:]*=[:blank:]+"(.*?)"\]\]/i';
$rtn = preg_replace($pattern, '<a href="${1}">${2}</a>', $rtn);

But this doesn't work, regular expression is hard to learn :(

Upvotes: 2

Views: 114

Answers (1)

I wrote a regular expression that matches everything you requested, but allows a bit of an overhead I''ll explain at the end. But first the regex:

Looks like this:

\[\[a(\s+(src|desc)\s*=\s*('[^']*'|"[^"]*")){1,2}\s*\]\]

I'll brake it down so you can understand it:

  • \[\[ ... \]\] matches [[ ... ]], the beginning and ending
  • \s matches any whitespace (space and tab), \s+ expects at least one
  • (src|desc) matches either the string src or the string desc. It's an OR operator: match src OR desc.
  • '[^']*' matches two single quotes and anything in between that is not a single quote
  • "[^"]*" same with double quotes
  • ('[^']*'|"[^"]*") matches one of the above two
  • (src|desc)\s*=\s*('[^']*'|"[^"]*") matches a token like src='something'
  • {1,2} matches something once or twice, appending to the above expression, metches one or two of those tokens

And that's pretty much it. The only problem is that it will also match this:

[[a src="http://a.c.d" src="http://a.c.d"]]

Which I think is a mismatch. If it doesn't bother you, you're good to go, otherwise you'll need to change the whole concept of using a big atom with ors (i.e.: |) and take a different approach. You could use look-aheads for example. But it will get real nasty pretty fast.

You can test it online HERE

The regex is much more readable if I remove the backslashes and the \s stuffs. This won't work, but I think it will help you understand it:

[[a ( (src|desc)=('[^']*'|"[^"]*") ){1,2} ]]

Upvotes: 1

Related Questions