Sunny
Sunny

Reputation: 227

Fastest way to search for whole words in 20mb flat file database (PHP)

I have 20MB flat file database with about 500k lines, only [a-z0-9-] characters are allowed, average 7 words in line, no empty or duplicate lines:

Flat file database:

put-returns-between-paragraphs
for-linebreak-add-2-spaces-at-end
indent-code-by-4-spaces-indent-code-by-4-spaces

I'm searhcing for whole words only and extracting first 10k results from this db.

So far this code work ok if the 10k matches are found in let's say first 20k lines of the db, but if the word is rare, the script must search all 500k lines and this is 10 times slower.

Settings:

$cats = file("cats.txt", FILE_IGNORE_NEW_LINES);
$search = "end";
$limit = 10000;

Search:

foreach($cats as $cat) {
    if(preg_match("/\b$search\b/", $cat)) {
        $cats_found[] = $cat;
        if(isset($cats_found[$limit])) break;
    }
}

My php skills and knowledge are limited, I cannot and don't know how to use sql, so this is the best I can do it, but I need some advices:

Thanks for reading this and sorry for bad English, this is my 3rd language.

Upvotes: 0

Views: 1144

Answers (2)

Gras Double
Gras Double

Reputation: 16383

If most of the lines don't contain the searched word, you could execute preg_match() less often, like so:

foreach ($lines as $line) {
    // fast prefilter...
    if (strpos($line, $word) === false) {
        continue;
    }
    // ... then proper search if the line passed the prefilter
    if (preg_match("/\b{$word}\b/", $line)) {
        // found
    }
}

Though, it requires benchmarking in practical situation.

Upvotes: 3

Vladimir Ramik
Vladimir Ramik

Reputation: 1930

This will work for you reading line by line though you might run out of memory:

( might need to tweak your php.ini memory_limit and max_execution_time or run via cli )

$rFile = fopen( 'inputfile.txt', 'r' );
$iLineNumber = 0;
$sSearch = '123';
$iLimit  = 5000;
while( !feof( $rFile ) )
{
    if( $iLineNumber > $iLimit )
    {
        break;
    }
    $sLine = fgets( $rFile );
    if( preg_match("/\b$sSearch\b/", $sLine, $aMatches ) ) 
    {
        $aCats[] = $aMatches[ 0 ];
    }
    ++$iLineNumber;
}
var_dump( $aCats );

My suggestion would be to reformat the file into a sql import and use a database. Flat file search is significantly slower.

Infile:

put-returns-between-paragraphs
for-linebreak-add-2-spaces-at-end
indent-code-by-4-spaces-indent-code-by-4-spaces
put-returns-between-paragraphs
for-linebreak-add-2-spaces-at-end
indent-code-by-4-spaces-indent-code-by-4-spaces
put-returns-between-paragraphs
123
for-linebreak-add-2-spaces-at-end
indent-code-by-4-spaces-indent-code-by-4-spaces
put-returns-between-paragraphs
for-linebreak-add-2-spaces-at-end
indent-code-by-4-spaces-indent-code-by-4-spaces
123
put-returns-between-paragraphs
for-linebreak-add-2-spaces-at-end
indent-code-by-4-spaces-indent-code-by-4-spaces

Output:

array(2) {
  [0]=>
  string(3) "123"
  [1]=>
  string(3) "123"
}

It was wrapping an additional array from matches so we have to use [ 0 ]

Upvotes: 1

Related Questions