Reputation: 26828

Why is "grep --ignore-case" 50 times slower?

I was very surprised to see that when you add the --ignore-case option to grep that it can slow down the search by 50x times. I've tested this on two different machines with the same result. I am curious to find out an explanation for the huge performance difference.

I would also like to see an alternative command to grep for case-insensitive searches. I don't need regular expressions, just fixed string searching. First the test file will be a 50 MB plain text file with some dummy data, you may use the following code to generate it:

Create test.txt

yes all work and no play makes Jack a dull boy | head -c 50M > test.txt
echo "Jack is no fun" >> test.txt
echo "Jack is no Fun" >> test.txt

Demonstration

Below is a demonstration of the slowness. By adding the --ignore-case option the command becomes 57x times slower.

$ time grep fun test.txt
all work and no plJack is no fun
real    0m0.061s

$ time grep --ignore-case fun test.txt
all work and no plJack is no fun
Jack is no Fun
real    0m3.498s

Possible Explanations

Googling around I found an a discussion on grep being slow in the UTF-8 locale. So I ran the following test, and it did speed up. The default locale on my machine is en_US.UTF-8, so setting it to POSIX seems to have made a performance boot, but now of course I can't search correctly on Unicode text which is undesirable. It is also still 2.5 times slower.

$ time LANG=POSIX grep --ignore-case fun test.txt
all work and no plJack is no fun
Jack is no Fun
real    0m0.142s

Alternatives

We could use Perl instead it is faster, but still 5.5 times faster then the case sensitive grep. And the POSIX grep above is about twice as fast.

$ time perl -ne '/fun/i && print' test.txt
all work and no plJack is no fun
Jack is no Fun
real    0m0.388s

So I'd love to find a fast correct alternative and an explanation if anyone has one.

UPDATE - CentOS

The two machines that were tested above both were running Ubuntu one 11.04 (Natty Narwhal), the other 12.04 (Precise Pangolin). Running the same tests on a CentOS 5.3 machine produces the following interesting results. The performance results of the two cases are almost identical. Now CentOS 5.3 was released in Jan 2009 an is running grep 2.5.1 while Ubuntu 12.04 is running grep 2.10. So there might be changes in the new version and differences in the two distributions.

$ time grep fun test.txt
Jack is no fun
real    0m0.026s

$ time grep --ignore-case fun test.txt
Jack is no fun
Jack is no Fun
real    0m0.027s

Upvotes: 16

Answers (4)

Pedro Gimeno

Reputation: 3125

The reason is that it needs to do a Unicode-aware comparison for the current locale, and judging by Marat's answer, it's not very efficient in doing so.

This shows how much faster it is when Unicode is not taken into consideration:

$ time LC_CTYPE=C grep -i fun test.txt
all work and no plJack is no fun
Jack is no Fun
real    0m0.192s

Of course, this alternative won't work with characters in other languages such as Ñ/ñ, Ø/ø, Ð/ð, Æ/æ and so on.

Another alternative is to modify the regex so that it matches with case insensitivity:

$ time grep '[Ff][Uu][Nn]' test.txt
all work and no plJack is no fun
Jack is no Fun
real    0m0.193s

This is reasonably fast, but of course it's a pain to convert each character into a class, and it's not easy to convert it to an alias or an sh script, unlike the above one.

For comparison, in my system:

$ time grep fun test.txt
all work and no plJack is no fun
real    0m0.085s

$ time grep -i fun test.txt
all work and no plJack is no fun
Jack is no Fun
real    0m3.810s

Upvotes: 1

Marat Bukharov

Reputation: 351

This slowness is due to grep (on a UTF-8 locale) constantly accesses files "/usr/lib/locale/locale-archive" and "/usr/lib/gconv/gconv-modules.cache".

It can be shown using the strace utility. Both files are from glibc.

Upvotes: 8

ams

Reputation: 25599

To do a case-insensitive search, grep first has to convert your entire 50 MB file to one case or the other. That's going to take time. Not only that, but there are memory copies...

In your test case, you first generate the file. This means that it will be memory cached. The first grep run only has to mmap the cached pages; it doesn't even have to access the disk.

The case-insensitive grep does the same, but then it tries to modify that data. This means the kernel will take an exception for each modified 4 kB page, and will end up having to copy the entire 50 MB into new memory, one page at a time.

Basically, I'd expect this to be slower. Maybe not 57x slower, but definitely slower.

Upvotes: 0

Why is &quot;grep --ignore-case&quot; 50 times slower?

Answers (4)

Related Questions

Why is "grep --ignore-case" 50 times slower?