Reputation: 77

Meaning of this RegEx

I'm not very well versed with regular expressions so i need some help. I'm using a jQuery dynacloud plugin that breaks at an identified point in my code when a regex match happens. i need someone to help me figure out what this regex matches

/^[a-z\xE4\xF6\xFC]*[A-Z\xC4\xD6\xDC]([A-Z\xC4\xD6\xDC\xDF]+|[a-z\xE4\xF6\xFC\xDF]{3,}

Any help please!!

Upvotes: 0

Answers (5)

Alan Moore

Reputation: 75222

I'll assume the )/ that's missing from the regex is just a cut-n-paste error on your part; they're present in the DynaCloud source code. What's not present is an end anchor ($), which I find surprising. Here's the relevant code:

var elems = jQuery(this).text()
            .replace(/[^A-Z\xC4\xD6\xDCa-z\xE4\xF6\xFC\xDF0-9_]/g, ' ')
            .replace(jQuery.dynaCloud.stopwords, ' ')
            .split(' ');
var word = 
  /^[a-z\xE4\xF6\xFC]*[A-Z\xC4\xD6\xDC]([A-Z\xC4\xD6\xDC\xDF]+|[a-z\xE4\xF6\xFC\xDF]{3,})/;

The first statement filters out unwanted characters, but leaves digits and underscores alone. The second statement tries to match a word consisting of ASCII letters plus a few non-ASCII letters that are used in (for example) German. However, once it runs out of letters to match, it's free to continue matching any characters, not just those listed in the first regex. Also, any digits or underscores in a word will cause the word to be broken up into two or more words.

I would try anchoring the regex at the end and adding support for digits and underscores, like this:

/^[a-z\xE4\xF6\xFC]*[A-Z\xC4\xD6\xDC]([A-Z\xC4\xD6\xDC\xDF0-9_]+|[a-z\xE4\xF6\xFC\xDF0-9_]{3,})$/g

This regex is just for illustration purposes; it's not intended to be a solution. For one thing, I made a wild guess on the positions of the digits and underscores. For another thing, it can now match words that end with digits and underscores, and you might not want that.

Upvotes: 0

SUB0DH

Reputation: 5240

Assuming this is your regex:

/^[a-z\xE4\xF6\xFC]*[A-Z\xC4\xD6\xDC]([A-Z\xC4\xD6\xDC\xDF]+|[a-z\xE4\xF6\xFC\xDF]{3,})/

The following would be an explanation for the regex:

"^" +                              // Assert position at the beginning of a line (at beginning of the string or after a line break character)
"[a-z\xE4\xF6\xFC]" +              // Match a single character present in the list below
                                      // A character in the range between “a” and “z”
                                      // ASCII character 0xE4 (228 decimal)
                                      // ASCII character 0xF6 (246 decimal)
                                      // ASCII character 0xFC (252 decimal)
   "*" +                              // Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
"[A-Z\xC4\xD6\xDC]" +              // Match a single character present in the list below
                                      // A character in the range between “A” and “Z”
                                      // ASCII character 0xC4 (196 decimal)
                                      // ASCII character 0xD6 (214 decimal)
                                      // ASCII character 0xDC (220 decimal)
"(" +                              // Match the regular expression below and capture its match into backreference number 1
                                      // Match either the regular expression below (attempting the next alternative only if this one fails)
      "[A-Z\xC4\xD6\xDC\xDF]" +          // Match a single character present in the list below
                                            // A character in the range between “A” and “Z”
                                            // ASCII character 0xC4 (196 decimal)
                                            // ASCII character 0xD6 (214 decimal)
                                            // ASCII character 0xDC (220 decimal)
                                            // ASCII character 0xDF (223 decimal)
         "+" +                              // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
   "|" +                              // Or match regular expression number 2 below (the entire group fails if this one fails to match)
      "[a-z\xE4\xF6\xFC\xDF]" +          // Match a single character present in the list below
                                            // A character in the range between “a” and “z”
                                            // ASCII character 0xE4 (228 decimal)
                                            // ASCII character 0xF6 (246 decimal)
                                            // ASCII character 0xFC (252 decimal)
                                            // ASCII character 0xDF (223 decimal)
         "{3,}" +                           // Between 3 and unlimited times, as many times as possible, giving back as needed (greedy)
")"

Upvotes: 0

red-X

Reputation: 5128

the \x** parts translate to a special charachter, if you replace those you basically get:

/^[a-zäöü]*[A-ZÄÖÜ]([A-ZÄÖÜß]+|[a-zäöüß]{3,})/

I'll take it apart for you:

^ beginning of string

[a-zäöü] characterset: any character from a to z or äöü * zero or more times

[A-ZÄÖÜ] characterset: any character from A to Z or ÄÖÜ just once

( start of group

[A-ZÄÖÜß] another character set, you should get it now :) + one or more times

| or

[a-zäöüß] characterset, {3,} 3 or more times

) end of group

also, you missed a )/ at the end, the / at the start and end means whats in between is the regex.

Upvotes: 1

N4553R

Reputation: 188

^ begining of a line

[...] a class of possible chars

a-z range (abcde...yz)

\xE4 hexadecimal value of a char ("ascii" code).

{n,m} between n and m occurrences.

* equivalent to {0,}

+ equivalent to {1,}

Upvotes: 1

Dominik

Reputation: 3362

I´d suggest you take a look at Expresso, given you missed the closing parenthesis, this is the result:

enter image description here

Upvotes: 1

Meaning of this RegEx

Answers (5)

Related Questions