Alexander
Alexander

Reputation: 73

Fastest approach to search within file contents of a directory

I got a directory that contains files for users of a program I have. There are around 70k json files in that directory.

The current search method is using glob and foreach. It's getting quite slow and hogging the server. Is there any good way to search through these files more efficiently? I'm running this on a Ubuntu 16.04 machine and I can use exec if needed.

Update:

Theses are json files and each file needs to be opened to check if it contains the search query or not. Looping over the files is quite fast, but when it needs to open each file, it takes quite a while.

These cannot be indexed using SQL or memcached, as I'm using memcached for some other things.

Upvotes: 1

Views: 813

Answers (2)

sepehr
sepehr

Reputation: 18455

As you implied yourself, to make this the most performant search possible, you need to hand over the task to a tool that is designed for this purpose.

I say, go beyond grep and see what's even better than ack. Also, see ag and then settle for ripgrep as it's the best of its kind in the town.


Experiment

I did a little experiment with ack on a low-spec laptop. I searched for an existing class name within 19,501 files. Here's the results:

$ cd ~/Dev/php/packages
$ ack -f | wc -l 
19501

$ time ack PHPUnitSeleniumTestCase | wc -l
10
ack PHPUnitSeleniumTestCase  7.68s user 2.99s system 21% cpu 48.832 total
wc -l  0.00s user 0.00s system 0% cpu 48.822 total

I did the same experiment, this time with ag. And it really surprised me:

$ time ag PHPUnitSeleniumTestCase | wc -l
10
ag PHPUnitSeleniumTestCase  0.24s user 0.98s system 13% cpu 9.379 total
wc -l  0.00s user 0.00s system 0% cpu 9.378 total

I was so excited with the results, I went on and tried ripgrep as well. Even better:

$ time rg PHPUnitSeleniumTestCase | wc -l
10
rg PHPUnitSeleniumTestCase  0.44s user 0.27s system 19% cpu 3.559 total
wc -l  0.00s user 0.00s system 0% cpu 3.558 total

Experiment with this family of tools, see what best suits your needs.


P.S. ripgrep's original author has left a comment under this post, saying that ripgrep is faster than {grep, ag, git grep, ucg, pt, sift}. Interesting read, fabulous work.

Upvotes: 2

matt
matt

Reputation: 4734

Depending on whether or not you're using SSD or HDD to store the files answer differs.

HDD

In case of HDD the most probable bottleneck isn't PHP but low number of I/O operation HDDs can handle. I would strongly advise to move to SSD or to use RAM disk if it's feasible.

Let's assume you're not able to move the directory to SSD. It means that you're stuck on HDD which can perform between ~70-~200 IOPS (I/O operation per second, assuming your system doesn't cache files in the directory in RAM). Your best bet is to minimize I/O calls like fstat, filemtime, file_exists etc and focus on operation that read files (file_get_contents() etc.).

HDD and operating system allow HDD controllers to group I/O operations to get around low IOPS available. For example if two files are close to each other on HDD you can read both or more of them at cost of reading just one of them (I'm simplifying things here, but let's not get into too technical details). So contrary to some beliefs reading multiple files at once (for example using threaded program, xargs etc.) might greatly improve performance.

Unfortunately this will be only the case if those files are close to each other on physical HDD. If you really want to speed up things you should first consider in what order you're going to read the files using your application as it's crucial for the next step. Once you figured it out you can erase the HDD drive completely (assuming you can do it) and write files to it sequentially in the order you settled on. This should place files side by side and improve effective IOPS when parallel file reads.

Next you need to go to shell and use program that can process files in parallel - PHP has support for pthreads but don't go down that route. xargs with multiple processes (-P option) might be helpful if you plan to use single threaded application. Read shell_exec() output and process it in your PHP program.

SSD

As with HDD parallel processing might help, it would be best however to see your code first as I/O might not be the problem.

Upvotes: 1

Related Questions