Mike Rockétt
Mike Rockétt

Reputation: 9007

Regular Expression: Delimited Extensions

I'm trying to formulate the shortest possible Regular Expression for checking if a parameter token (in a string) contains a certain file extension.

The extensions I'd like to check for are:

asp,aspx,cfm,cgi,fcgi,dll,htm,html,shtm,shtml,jhtml,phtml,xhtm,rbml,jsp,php,phps,php4

At present, I have the following expression in place:

aspx?|cfm|f?cgi|dll|s?html?|jhtml|phtml|xhtm|rbml|jsp|phps?|php4

I'm sure there's a shorter way to do it, but I'm not a RegEx junkie, and so I'm not the best at this.

Upvotes: 0

Views: 69

Answers (2)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

To build an efficient pattern that begins with an alternation, you only need to take in account the first character of each alternatives. The reason is that once the first character matches, you don't need to test an other alternative. Here, if I count the number of occurence for each first character in the list, I obtain:

p:4
a,c,h,j,s:2
d,f,r,x:1

So, the pattern will look like this:

(?:p...|a...|c...|h...|j...|s...|d...|f...|r...|x...)

Now, I only need to fill each member of the alternation:

(?:ph(?:p[4s]?|tml)|aspx?|c(?:fm|gi)|html?|j(?:html|sp)|shtml?|dll|fcgi|rbml|xhtm)

But an alternation has a cost in particular at the begining of a pattern since each member of the alternation must be tested for each characters in the string, including characters that are not one of the first character in the alternation. To avoid the problem, you can use the first character discrimination technique to reduce the tests to the concerned characters.

(?=[acdfhjprsx])(?:ph(?:p[4s]?|tml)|aspx?|c(?:fm|gi)|html?|j(?:html|sp)|shtml?|dll|fcgi|rbml|xhtm)

Note: I have choosen here to sort the first characters from the more frequent to the less frequent. But if in practice, you note that for example "dll" is the most frequent you can change the position of the "d" alternative.

Note2: don't believe that a short pattern is an efficient pattern.

Upvotes: 1

Barmar
Barmar

Reputation: 780974

You can combine some of them:

aspx?|cfm|f?cgi|dll|s?html?|[jp]html|xhtm|rbml|jsp|php[s4]?

However, in my opinion your original regexp is fine. Shorter isn't necessarily better. Listing all the cases separately makes it clearer what you're doing. Merging lots of cases makes it hard to understand.

Upvotes: 3

Related Questions