Wilfred Hughes
Wilfred Hughes

Reputation: 31171

Does order matter in extended regular expressions with []?

I'm trying to understand the [] syntax with extended regular expressions in grep.

The following two patterns are equivalent:

$ echo "foo_bar" | grep -E "[a-z_]+$"     
foo_bar
$ echo "foo_bar" | grep -E "[_a-z]+$" 
foo_bar

However, these two are not:

$ echo "foobar[]" | grep -E "[a-z_\[\]]+$" 
foobar[]
$ echo "foobar[]" | grep -E "[a-z\[\]_]+$"

Why is this? Is this documented anywhere? I couldn't see anything in man grep about this.

Upvotes: 3

Views: 126

Answers (1)

Micha Wiedenmann
Micha Wiedenmann

Reputation: 20843

You should be careful when using double quotes " and backslashes \ since BASH handles the backslashes first. This changes your regular expression to [a-z_[]]+$. However there still is a fine point and for the remainder of this question I assume that you had used single quotes.

In the first case you have the character group [a-z_\[\], which matches characters a-z, _, \, [. The final \] does not list ] as another character of the character group but rather is another \ and a the closing bracket of the character class. Notice how:

$ echo "foobar[]" | grep -E '[a-z\[\]+\]+$'
foobar[]
$ echo '\' | grep -E '[\]$'
\

If you want to add ] you have to list it first, that is []] matches a single ].

$ echo "]" | grep -E '[]]$'
]

For a reference see man grep:

To include a literal ] place it first in the list. Similarly, to include a literal ^ place it anywhere but first. Finally, to include a literal - place it last.

as well as https://www.regular-expressions.info/charclass.html

In most regex flavors, the only special characters or metacharacters inside a character class are the closing bracket ], the backslash \, the caret ^, and the hyphen -. The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash. To search for a star or plus, use [+*]. Your regex will work fine if you escape the regular metacharacters inside a character class, but doing so significantly reduces readability.

Even more test cases to examine [\s] (which is the same as [s\] and different from [[:space:]]):

$ echo 'a ' | grep -E 'a[\s]$'
$ echo 's' | grep -E '[\s]$'
s
$ echo '\' | grep -E '[\s]$'
\
$ echo 'a ' | grep -E 'a[[:space:]]$'
a

So the takeaway is: Order does not matter when listing characters of a character class, except when it does.

Upvotes: 2

Related Questions