Reputation: 26828
I was very surprised to see that when you add the --ignore-case
option to grep
that it can slow down the search by 50x times. I've tested this on two different machines with the same result. I am curious to find out an explanation for the huge performance difference.
I would also like to see an alternative command to grep for case-insensitive searches. I don't need regular expressions, just fixed string searching. First the test file will be a 50 MB plain text file with some dummy data, you may use the following code to generate it:
Create test.txt
yes all work and no play makes Jack a dull boy | head -c 50M > test.txt
echo "Jack is no fun" >> test.txt
echo "Jack is no Fun" >> test.txt
Demonstration
Below is a demonstration of the slowness. By adding the --ignore-case
option the command becomes 57x times slower.
$ time grep fun test.txt
all work and no plJack is no fun
real 0m0.061s
$ time grep --ignore-case fun test.txt
all work and no plJack is no fun
Jack is no Fun
real 0m3.498s
Possible Explanations
Googling around I found an a discussion on grep being slow in the UTF-8 locale. So I ran the following test, and it did speed up. The default locale on my machine is en_US.UTF-8
, so setting it to POSIX
seems to have made a performance boot, but now of course I can't search correctly on Unicode text which is undesirable. It is also still 2.5 times slower.
$ time LANG=POSIX grep --ignore-case fun test.txt
all work and no plJack is no fun
Jack is no Fun
real 0m0.142s
Alternatives
We could use Perl instead it is faster, but still 5.5 times faster then the case sensitive grep. And the POSIX grep above is about twice as fast.
$ time perl -ne '/fun/i && print' test.txt
all work and no plJack is no fun
Jack is no Fun
real 0m0.388s
So I'd love to find a fast correct alternative and an explanation if anyone has one.
UPDATE - CentOS
The two machines that were tested above both were running Ubuntu one 11.04 (Natty Narwhal), the other 12.04 (Precise Pangolin). Running the same tests on a CentOS 5.3 machine produces the following interesting results. The performance results of the two cases are almost identical. Now CentOS 5.3 was released in Jan 2009 an is running grep 2.5.1 while Ubuntu 12.04 is running grep 2.10. So there might be changes in the new version and differences in the two distributions.
$ time grep fun test.txt
Jack is no fun
real 0m0.026s
$ time grep --ignore-case fun test.txt
Jack is no fun
Jack is no Fun
real 0m0.027s
Upvotes: 16
Views: 3967
Reputation: 3125
The reason is that it needs to do a Unicode-aware comparison for the current locale, and judging by Marat's answer, it's not very efficient in doing so.
This shows how much faster it is when Unicode is not taken into consideration:
$ time LC_CTYPE=C grep -i fun test.txt
all work and no plJack is no fun
Jack is no Fun
real 0m0.192s
Of course, this alternative won't work with characters in other languages such as Ñ/ñ, Ø/ø, Ð/ð, Æ/æ and so on.
Another alternative is to modify the regex so that it matches with case insensitivity:
$ time grep '[Ff][Uu][Nn]' test.txt
all work and no plJack is no fun
Jack is no Fun
real 0m0.193s
This is reasonably fast, but of course it's a pain to convert each character into a class, and it's not easy to convert it to an alias or an sh
script, unlike the above one.
For comparison, in my system:
$ time grep fun test.txt
all work and no plJack is no fun
real 0m0.085s
$ time grep -i fun test.txt
all work and no plJack is no fun
Jack is no Fun
real 0m3.810s
Upvotes: 1
Reputation: 351
This slowness is due to grep (on a UTF-8 locale) constantly accesses files "/usr/lib/locale/locale-archive" and "/usr/lib/gconv/gconv-modules.cache".
It can be shown using the strace utility. Both files are from glibc.
Upvotes: 8
Reputation: 25599
To do a case-insensitive search, grep first has to convert your entire 50 MB file to one case or the other. That's going to take time. Not only that, but there are memory copies...
In your test case, you first generate the file. This means that it will be memory cached. The first grep run only has to mmap the cached pages; it doesn't even have to access the disk.
The case-insensitive grep does the same, but then it tries to modify that data. This means the kernel will take an exception for each modified 4 kB page, and will end up having to copy the entire 50 MB into new memory, one page at a time.
Basically, I'd expect this to be slower. Maybe not 57x slower, but definitely slower.
Upvotes: 0
Reputation: 27944
I think this bug report helps in understanding why it is slow:
bug report grep, slow on ignore-case
Upvotes: 8