Anmol G
Anmol G

Reputation: 191

Speed up an algorithm using PHP for large textual data and files

There are two tables as below:-

  1. document table - this table contains the path of the file which actually contains HTML content and also has a column for hierarchy

  1. find and replace - this table contains the word to find and to replace( the replace string can be a link or HTML itself ) and remaining fields are comma separated ids (document ID from table 1) which tells which word is to be replaced in which document

In short, this process will allow the user to find and replace keywords based on the second table and only in the documents required.

The algorithm works as below:-

  1. Get count of all records in the documents table
  2. Break in sets of 100 records ( to reduce server timeout )
  3. loop over the set of 100 each and for each record here using the document id and hierarchy no get the list of keywords and also the content to be replaced with to be replaced in this particular document (Note, the where condition runs on comma separated string)
  4. fetch the file from the server using the path in the first table and extract the HTML content
  5. run a loop on each keyword in sequence and replace with the required content as per the second table in the content
  6. create a final file and save on the server

The process works fine and gives desired results too.

The problem begins when the data increases. As for now, there are around 50,000 entries in the first table and thus the same number of files on the server.

The second table contains around 15000 records of find and replaces keywords with long strings comma separated with documents id.

For such amount of data, this process will run for days and that should not happen.

For database MySQL 5.5 is used and the backend is PHP(Laravel 5.4). OS is centos 7 with nginx web server.

Is there a way to make this process smooth and less time-consuming? Any help is appreciated.

Upvotes: 0

Views: 191

Answers (1)

O. Jones
O. Jones

Reputation: 108816

php has a function shell_exec($shellCommand);

You may wish to use the gnu/linux shell-accessible program called sed (stream editor) to do this substitution rather than slurping each file into php then writing it out again.

For example,

 $result = shell_exec
      ("cd what/ever/directory; sed 's/this/that/g' inputfile > outputfile");

will read what/ever/directory/inputfile, change all the this strings to that, and write the result into what/ever/directory/outputfile. And, it will do it very quickly compared to php.

Edit: Why does this approach save a lot of time?

  • Shell programs like sed have been around for decades and are highly optimized. sed uses far less processing power--far fewer cpu cycles--than php to do what it does. So the transformation of the files is faster.
  • The task of editing a file requires reading, transforming, and writing it. Doing this operation the way you describe requires each of those phases to finish before the next one can start. On the other hand, sed is a stream editor. It reads, transforms, and writes all in parallel.

To get the most out of this approach, you'll need to get your php program to write more complex editing commands than 's/this/that/g'. You'll want to do multiple substitutions in a single sed run. You can do that by concatenating editing instructions like this example:

 's/this/that/; s/blue/azul/g; s/red/rojo/g'

A single shell command can be around 100K characters in length, so you probably won't hit limits on the length of those editing instructions.

By suggesting the use of sed I do suggest using a differnt algorithm.

Upvotes: 0

Related Questions