Sandeep Samal
Sandeep Samal

Reputation: 143

Getting peculiar results while working with "[ ]" in egrep though "\"(escape sequence) used in Linux

Recently I came across below situation while doing some home work with regular expressions.

s@ubuntu:~$ echo b | egrep []b]
b
s@ubuntu:~$ echo b | egrep [[b]
b
s@ubuntu:~$ echo b | egrep []b[]
b
s@ubuntu:~$ echo b | egrep [b[]
b
s@ubuntu:~$ echo b | egrep [[b]]
s@ubuntu:~$ echo b | egrep [b]]
s@ubuntu:~$ echo b | egrep [b\]]
s@ubuntu:~$ echo b | egrep [b\\]]
s@ubuntu:~$ echo b | egrep [\[b\]]

Why I'm not getting 'b' printed in last 5 cases?

Upvotes: 3

Views: 45

Answers (2)

Giuseppe Ricupero
Giuseppe Ricupero

Reputation: 6272

The reason for this is in the special rules applied inside the bracket expressions:

The right square bracket] have to be placed immediately after the opening [ or [^ to be treated as a literal.

and

An escape char \ is treated literally inside a chars class [...]

In addiction the shell apply the escape char \ prior to pass the expression to egrep, because of the missing single '...' or double quotes "..." around the regex.

Jonathan Leffler explain it well with examples, i can only report a link to the Posix rules of expansions inside brackets to add an overview:

http://pubs.opengroup.org/onlinepubs/007904875/basedefs/xbd_chap09.html#tag_09_03_05

UPDATE

The same expressions with quotes:

# this matches 'b]' or '\]'
~$ echo b] | egrep '[b\]]'
b]
~$ echo '\]' | egrep '[b\]]' # note the quotes prior and after the pipe
\]

# the next one is equivalent to '[b\]]' 
# cause a double \ inside chars class is redundant
~$ echo b] | egrep '[b\\]]'
b]
~$ echo '\]' | egrep '[b\\]]'
\]

# the last one matches '\]' or '[]' or 'b]'
~$ echo b] | egrep '[\[b\]]'
[b]
~$ echo [] | egrep '[\[b\]]'
[]
~$ echo '\]' | egrep '[\[b\]]'
\]
# without quotes in the echo section, the escape \ is applied by the shell
# so egrep receive only a closing bracket ']' and nothing is printed out
~$ echo \] | egrep '[\[b\]]'

# If we remove instead the quotes from the egrep section 
# the regex becomes equivalent to [[b]] so it now matches '[]' or 'b]' and not '\]' anymore
~$ echo '\]' | egrep [\[b\]]
~$ echo '[]' | egrep [\[b\]]
[] 
~$ echo 'b]' | egrep [\[b\]]
b]

Upvotes: 3

Jonathan Leffler
Jonathan Leffler

Reputation: 754480

  • egrep [[b]] — Looks for a b or [ followed by a ]; not found.
  • egrep [b]] — Looks for a b followed by a ]; not found.
  • egrep [b\]] — Looks for a b followed by a ]; not found. The backslash is omitted by the shell and not seen by egrep.
  • egrep [b\\]] — Looks for a b or a backslash followed by ]; not found.
  • egrep [\[b\]] — Looks for a b or a [ followed by ]; not found. The backslashes are omitted by the shell and not seen by egrep.

Inside a character class (started by [), the first ] terminates the class unless the ] is the first character after the [, or the first character after the [^ for a negated character class. Note that ] is not a regex metacharacter unless there is a preceding [ making it into the end of a character class. You also find that $ is not a metacharacter in the middle of a string, nor ^ unless it appears at the start, nor * nor + nor ? if they appear first, etc. See POSIX Regular Expressions for a detailed discussion — the regular expressions handled by egrep (now grep -E) are 'extended regular expressions'.

The shell messes around with backslashes before egrep gets a chance to see them. You should enclose your regex in single quotes to avoid the shell altering what egrep sees.

You can demonstrate my analysis by changing what is echoed:

echo '[b]' | egrep [[b]]
echo '[b]' | egrep [b]]
echo '[b]' | egrep [b\]]
echo '[b]' | egrep [b\\]]
echo '[b]' | egrep [\[b\]]

The output from that is:

[b]
[b]
[b]
[b]
[b]

The [ in these examples (in the echoed data) is present for cosmetic reasons; it could be omitted and the lines would be accepted.

Upvotes: 5

Related Questions