Sarabjot
Sarabjot

Reputation: 509

Regular expression that validates a CSS selector

What is a regular expression that can be used to validate a CSS selector, and can do so in a way that a invalid selector halts quickly.

Valid selectors:

EE
#myid
.class
.class.anotherclass
EE .class
EE .class EEE.anotherclass
EE[class="test"]
.class[alt~="test"]
#myid[alt="test"]
EE:hover
EE:first-child
E[lang|="en"]:first-child
EE#test .class>.anotherclass
EE#myid.classshit.anotherclass[class~="test"]:hover
EE#myid.classshit.anotherclass[class="test"]:first-child EE.Xx:hover

Invalid selectors, e.g. contain extra whitespace at the end of the line:

EE:hover   EE
EE .class EEE.anotherclass 
EE#myid.classshit.anotherclass[class="test"]:first-child EE.Xx:hov     9
EE#myid.classshit.anotherclass[class="test"]:first-child EE.Xx:hov  -daf

Upvotes: 0

Views: 1598

Answers (3)

fuxia
fuxia

Reputation: 63566

Regular expressions are the wrong tool. CSS selectors are way to complex. Example:

bo\
dy:not(.\}) {}

Use a parser with a real tokenizer like this one: PHP-CSS-Parser. It is easier to rewrite it to Java than getting regex right.

Upvotes: 4

André Banderas
André Banderas

Reputation: 11

It's a Regex that I use in my codes:

[+>~, ]?\s*(\w*[#.]\w+|\w+|\*)+(:[\w\-]+\([\w\s\-\+]*\))*(\[[\w ]+=?[^\]]*\])*([#.]\w+)*(:[\w\-]+\([\w\s\-\+]*\))*

After tokenized I use the trim function to remove extra spaces e.g.:

expression:

EE.class      EE#id.class

tokens:

EE.class

   EE#id.class

tokens after trim:

EE.class

EE#id.class

OR e.g.

>EE.class (Alert when it's a direct child, then I treat with any substring code )

Other routines can check if token is a number e.g.

You can use http://regexpal.com/ for tests.

Upvotes: 1

Tony Ennis
Tony Ennis

Reputation: 12299

The problem with yer typical regular expression is that they are unable to handle arbitrary levels of nesting. They have no memory. Consider a string of some number of a's followed by the same number of b's: aaabbb and a reasonable regexp a*b*. When the regexp gets to the first 'b' it has no memory how many a's it recognized and therefore it can't recognize the same number of b's.

Now replace a and b with ( and ), IF and END, <x> and </x> etc... and you can see the problem.

Upvotes: 0

Related Questions