Reputation: 59
First I am new to PHP so I don't have any idea on how to accomplish this. I have a folder that is constantly getting txt
files created ranging in size and text. I am trying to create somewhat of a "search engine" on a Linux system written in PHP. So far I am using the code below.
if ( $_SERVER['REQUEST_METHOD'] == 'POST'){
$path = '/example/files';
$findThisString = $_POST['text_box'];
$dir = dir($path);
while (false !== ($file = $dir->read())){
if ($file != '.' && $file != '..'){
if (is_file($path . '/' . $file)){
$data = file_get_contents($path . '/' . $file);
if (stripos($data, $findThisString) !== false){
echo '<p></p><font style="color:white; font-family:Arial">Found Match - <a href="http://test.example.com/files/'. $file .'">'. $file .'</a> <br>';
}
}
}
}
}
$dir->close();
Now this code works great! But one problem, once the folder gets around 40,000 files, the search takes a good amount of time to pull any results. Now I can't use any commands such as greb
. It has to be written in pure PHP like the code above.
Is there anyway to optimize the code above to work any faster? Or is there a better search function I can use in PHP?
Upvotes: 0
Views: 186
Reputation: 2061
There are many reasons for why the script is so slow, and exactly what you need to do in order to decrease the time it takes depends completely upon what exact parts of the code causes the slow down.
That means that you need to put the code through a profiler, and then tweak the parts of the code that it reports are the cause. Without the profiler, all we can do is guess. Not necessarily correctly.
As noted in the comments to your question, using an already-made search engine would be the far better solution. Especially something which is purpose made for something like this, as it will cut down the time drastically.
Even the built-in grep
command for Linux shells would be an improvement.
That said, I do suspect that the reason your code is so slow is because of the fact that you're reading and searching through the contents of all of the files in PHP. stripos()
is particularly a likely suspect here, as that's a rather slow search.
Another factor might be the read()
calls in the loop, as I believe they do a IO-operation on each call. Also, having a lot of calls to echo
in a script can/will also cause a slow-down, depending upon how many of those you have. Couple of hundred is not really noticeable, but having a few thousand will be.
Taking these last points into consideration, and some other general changes I recommend to make your code easier to maintain, I've made the following changes to your code.
<?php
if (isset ($_POST['text_box'])) {
$path = '/example/files';
$result = search_files ($_POST['text_box'], $path);
}
/**
* Searches through the files in the given path, for the search term.
*
* @param string $term The term to search for, only "word characters" as defined by RegExp allowed.
* @param string $path The path which contains the files to be searched.
*
* @return string Either a list of links to the files, or an error message.
*/
function search_files ($term, $path) {
// Ensuring that we have a closing slash at the end of the path, so that
// we can add a file-descriptor for glob() to use.
if (substr ($path, -1) != '/') {
$path .= '/';
}
// If we don't have a valid/readable path we ened to throw an error now.
// This only happens if the code itself is wrong, as it's not user-supplied,
// thus an exception is thrown.
if (!is_dir ($path) || !is_readable ($path)) {
throw new InvalidArgumentException ("Not a valid search path!");
}
// This should be validated to ensure you get sane input,
// in order to avoid erroneous responses to the user and
// possible attacks.
// Addded a simple test to ensure we only accept "word characters".
if (!preg_match ('/^\w+\\z/', $term)) {
// Invalid input. Show warning to user.
return 'Not a valid search string.';
}
// Using glob so that we retrieve a list of all files in one operation.
$contents = glob ($path.'*');
// Using a holding variable, as this many echo statements take
// noticable longer time than just concatenating strings and
// echoing it out once.
$output = '';
// Using printf() templates to make the code easier to reach.
// Ideally the HTML-code shouldn't be in this string either, but adding
// a templating system is far beyond the reach of this Q&A.
$outTemplate = '<p class="found">Found Match - <a href="http://test.example.com/files/%1$s">%2$s</a></p>';
foreach ($contents as $file) {
// Skip the hardlinks for parent and current folder.
if ($file == '.' || $file == '..') {
continue;
}
// Skip if the path isn't a file.
if (!is_file ($path . '/' . $file)) {
continue;
}
// This one is the big issue. Reading all of the files one by one will take time!
$data = file_get_contents ($path . '/' . $file);
// Same with running a case-insensitive search!
if (stripos ($data, $term) !== false) {
// Added output escaping to prevent issues with possible meta-characters.
// (A problem also known as XSS attacks)
$output .= sprintf ($outTemplate, htmlspecialchars (rawurlencode($file)), htmlspecialchars($file));
}
}
// Lastly, if the output string is empty we haven't found anything.
if (empty($output)) {
return "Term not found";
}
return $output;
}
Upvotes: 1
Reputation: 139
if u cant use linux command when u have two ways: 1) It's save files in the Database and after this, when u need find u call query from database for search files. 2) It's create one indexed file(files which will be save in the him list files)
1 and 2 ways help u save time for execute script. For update files u can write Cron task which will be start import new files in the database or file.
Upvotes: 0