Clement Smith
Clement Smith

Reputation: 187

PCRE Regex Syntax

I guess this is more or less a two-part question, but here's the basics first: I am writing some PHP to use preg_match_all to look in a variable for strings book-ended by {}. It then iterates through each string returned, replaces the strings it found with data from a MySQL query.

The first question is this: Any good sites out there to really learn the ins and outs of PCRE expressions? I've done a lot of searching on Google, but the best one I've been able to find so far is http://www.regular-expressions.info/. In my opinion, the information there is not well-organized and since I'd rather not get hung up having to ask for help whenever I need to write a complex regex, please point me at a couple sites (or a couple books!) that will help me not have to bother you folks in the future.

The second question is this: I have this regex

"/{.*(_){1}(.*(_){1}[a-z]{1}|.*)}/"

and I need it to catch instances such as {first_name}, {last_name}, {email}, etc. I have three problems with this regex.

The first is that it sees "{first_name} {last_name}" as one string, when it should see it as two. I've been able to solve this by checking for the existence of the space, then exploding on the space. Messy, but it works.

The second problem is that it includes punctuation as part of the captured string. So, if you have "{first_name} {last_name},", then it returns the comma as part of the string. I've been able to partially solve this by simply using preg_replace to delete periods, commas, and semi-colons. While it works for those punctuation items, my logic is unable to handle exclamation points, question marks, and everything else.

The third problem I have with this regex is that it is not seeing instances of {email} at all.

Now, if you can, are willing, and have time to simply hand me the solution to this problem, thank you as that will solve my immediate problem. However, even if you can do this, please please provide an lmgfty that provides good web sites as references and/or a book or two that would provide a good education on this subject. Sites would be preferable as money is tight, but if a book is the solution, I'll find the money (assuming my local library system is unable to procure said volume).

Upvotes: 2

Views: 5182

Answers (3)

Carl
Carl

Reputation: 44488

  1. Here's a good regex site.
  2. Here's a PCRE regex that will work: \{\w+\}

Here's how it works: It's basically looking for { followed by one ore more word characters followed by }. The interesting part is that the word character class actually includes an underscore as well. \w is essentially shorthand for [A-Za-z0-9_]

So it will basically match any combination of those characters within braces and because of the plus sign will only match braces that are not empty.

Upvotes: 0

ChrisF
ChrisF

Reputation: 180

For PCRE, I simply digested the PCRE manpages, but then my brain works that way anyway...

As for matching delimited stuff, you generally have 2 approaches:

  1. Match the first delimiter, match anything that is not the closing delimiter, match the closing delimiter.
  2. Match the first delimiter, match anything ungreedily, match the closing delimiter.

E.g. for your case:

  1. \{([^}]+)\}
  2. \{(.+?)\} - Note the ? after the +

I added a group around the content you'd likely want to extract too.

Note also that in the case of #1 in particular but also for #2 if "dot matches anything" is in effect (dotall, singleline or whatever your favourite regex flavour calls it), that they would also match linebreaks within - you'd need to manually exclude that and anything else you don't want if that would be a problem; see the above answer for if you want something more like a whitelist approach.

Upvotes: 1

Jan Krüger
Jan Krüger

Reputation: 18530

Back then I found PHP's own PCRE syntax reference quite good: http://uk.php.net/manual/en/reference.pcre.pattern.syntax.php

Let's talk about your expression. It's quite a bit more verbose than necessary; I'm going to simplify it while we go through this.

A rather simpler way of looking at what you're trying to match: "find a {, then any number of letters or underscores, then a }". A regular expression for that is (in PHP's string-y syntax): '/\{[a-z_]+\}/'

This will match all of your examples but also some wilder ones like {__a_b}. If that's not an option, we can go with a somewhat more complex description: "find a {, then a bunch of letters, then (as often as possible) an underscore followed by a bunch of letters, then a }". In a regular expression: /\{([a-z]+(_[a-z]+)*\}/

This second one maybe needs a bit more explanation. Since we want to repeat the thing that matches _foo segments, we need to put it in parentheses. Then we say: try finding this as often as possible, but it's also okay if you don't find it at all (that's the meaning of *).

So now that we have something to compare your attempt to, let's have a look at what caused your problems:

  • Your expression matches any characters inside the {}, including } and { and a whole bunch of other things. In other words, {abcde{_fgh} would be accepted by your regex, as would {abcde} fg_h {ijkl}.
  • You've got a mandatory _ in there, right after the first .*. The (_){1} (which means exactly the same as _) says: whatever happens, explode if this ain't here! Clearly you don't actually want that, because it'll never match {email}.

Here's a complete description in plain language of what your regex matches:

  1. Match a {.
  2. Match a _.
  3. Match absolutely anything as long as you can match all the remaining rules right after that anything.
  4. Match a _.
  5. Match a single letter.
  6. Instead of that _ and the single letter, absolutely anything is okay, too.
  7. Match a }.

This is probably pretty far from what you wanted. Don't worry, though. Regular expressions take a while to get used to. I think it's very helpful if you think of it in terms of instructions, i.e. when building a regular expression, try to build it in your head as a "find this, then find that", etc. Then figure out the right syntax to achieve exactly that.

This is hard mainly because not all instructions you might come up with in your head easily translate into a piece of a regular expression... but that's where experience comes in. I promise you that you'll have it down in no time at all... if you are fairly methodical about making your regular expressions at first.

Good luck! :)

Upvotes: 3

Related Questions