Reputation: 637
I have some very large data files and for business reasons I have to do extensive string manipulation (replacing characters and strings). This is unavoidable. The number of replacements runs into hundreds of thousands.
It's taking longer than I would like. PHP is generally very quick but I'm doing so many of these string manipulations that it's slowing down and script execution is running into minutes. This is a pain because the script is run frequently.
I've done some testing and found that str_replace is fastest, followed by strstr, followed by preg_replace. I've also tried individual str_replace statements as well as constructing arrays of patterns and replacements.
I'm toying with the idea of isolating string manipulation operation and writing in a different language but I don't want to invest time in that option only to find that improvements are negligible. Plus, I only know Perl, PHP and COBOL so for any other language I would have to learn it first.
I'm wondering how other people have approached similar problems?
I have searched and I don't believe that this duplicates any existing questions.
Upvotes: 11
Views: 2272
Reputation: 4319
does this manipulation have to happen on the fly? if not, might i suggest pre-processing... perhaps via a cron job.
define what rules your going to be using. is it just one str_replace or a few different ones? do you have to do the entire file in one shot? or can you split it into multiple batches? (e.g. half the file at a time)
once your rules are defined decide when you will do the processing. (e.g. 6am before everyone gets to work)
then you can setup a job queue. i have used apache's cron jobs to run my php scripts on a given time schedule.
for a project i worked on a while ago i had a setup like this:
7:00 - pull 10,000 records from mysql and write them to 3 separate files.
7:15 - run a complex regex on file one.
7:20 - run a complex regex on file two.
7:25 - run a complex regex on file three.
7:30 - combine all three files into one.
8:00 - walk into the metting with the formatted file you boss wants. *profit*
hope this helps get you thinking...
Upvotes: 0
Reputation: 285
Since you know Perl, I would suggest doing the string manipulations in perl using regular expressions and use the final result in PHP web page.
This seems better for the following reasons
You can use PHP where necessary only.
Upvotes: 0
Reputation: 11942
I think the question is why are you running this script frequently? Are you performing the computations (the string replacements) on the same data over and over again, or are you doing it on different data every time?
If the answer is the former then there isn't much more you can do to improve performance on the PHP side. You can improve performance in other ways such as using better hardware (SSDs for faster reads/writes on the files), multicore CPUs and breaking up the data into smaller pieces running multiple scripts at the same time to process the data concurrently, and faster RAM (i.e. higher bus speeds).
If the answer is the latter then you might want to consider caching the result using something like memcached or reddis (key/value cache stores) so that you can only perform the computation once and then it's just a linear read from memory, which is very cheap and involves virtually no CPU overhead (you might also utilize CPU cache at this level).
String manipulation in PHP is already cheap because PHP strings are essentially just byte arrays. There's virtually no overhead from PHP in reading a file into memory and storing it in a string. If you have some sample code that demonstrates where you're seeing performance issues and some bench mark numbers I might have some better advice, but right now it just looks like you need refactor your approach based on what your underlying needs are.
For example, there are both CPU and I/O costs to consider individually when you're dealing with data in different situations. I/O involves blocking since it's a system call. This means your CPU has to wait for more data to come over the wire (while your disk transfers data to memory) before it can continue to process or compute that data. Your CPU is always going to be much faster than memory and memory is always much faster than disk.
Here's a simple benchmark to show you the difference:
/* First, let's create a simple test file to benchmark */
file_put_contents('in.txt', str_repeat(implode(" ",range('a','z')),10000));
/* Now let's write two different tests that replace all vowels with asterisks */
// The first test reads the entire file into memory and performs the computation all at once
function test1($filename, $newfile) {
$start = microtime(true);
$data = file_get_contents($filename);
$changes = str_replace(array('a','e','i','o','u'),array('*'),$data);
file_put_contents($newfile,$changes);
return sprintf("%.6f", microtime(true) - $start);
}
// The second test reads only 8KB chunks at a time and performs the computation on each chunk
function test2($filename, $newfile) {
$start = microtime(true);
$fp = fopen($filename,"r");
$changes = '';
while(!feof($fp)) {
$changes .= str_replace(array('a','e','i','o','u'),array('*'),fread($fp, 8192));
}
file_put_contents($newfile, $changes);
return sprintf("%.6f", microtime(true) - $start);
}
The above two tests do the same exact thing, but Test2 proves significantly faster for me when I'm using smaller amounts of data (roughly 500KB in this test).
Here's the benchmark you can run...
// Conduct 100 iterations of each test and average the results
for ($i = 0; $i < 100; $i++) {
$test1[] = test1('in.txt','out.txt');
$test2[] = test2('in.txt','out.txt');
}
echo "Test1 average: ", sprintf("%.6f",array_sum($test1) / count($test1)), "\n",
"Test2 average: ", sprintf("%.6f\n",array_sum($test2) / count($test2));
For me the above benchmark gives Test1 average: 0.440795
and Test2 average: 0.052054
, which is an order of magnitude difference and that's just testing on 500KB of data. Now, if I increase the size of this file to about 50MB Test1 actually proves to be faster since there are fewer system I/O calls per iteration (i.e. we're just reading from memory linearly in Test1), but more CPU cost (i.e. we're performing a much larger computation per iteration). The CPU generally proves to be able to handle much larger amounts of data at a time than your I/O devices can send over the bus.
So it's not a one-size-fits-all solution in most cases.
Upvotes: 0
Reputation: 36341
It is possible that you have hit a wall with PHP. PHP is great, but in some areas it fails, such as processing LOTS of data. There are a few things you could do:
Upvotes: 0
Reputation: 48387
The limiting factor is about PHP rebuilding the strings. Consider:
$out=str_replace('bad', 'good', 'this is a bad example');
It's a relatively low cost operation to locate 'bad' in the string, but in order to make room for the substitution, PHP then has to move up, each of the chars e,l,p,m,a,x,e,space before writing in the new value.
Passing arrays for the needle and haystack will improve performance, but not as much as it might.
AFAIK, PHP does not have low level memory access functions, hence an optimal solution would have to be written in a different language, dividing the data up into 'pages' which can be stretched to accomodate changes. You could try this using chunk_split to divide the string up into smaller units (hence each replacement would require less memory juggling).
Another approach would be to dump it into a file and use sed (this still operates one search/replace at a time), e.g.
sed -i 's/good/bad/g;s/worse/better/g' file_containing_data
Upvotes: 1
Reputation: 3806
If you'd allow the replacement to be handled over multiple executions, you could create a script that process each file, temporarily creating replacement files with duplicate content. This would allow for you to extract data from one file to another, process the copy - and then merge the changes, or if you use a stream buffer you might be able to remember each row so the copy/merge step can be skipped.
The problem though might be that you process a file without completing it, rendering it mixed. Therefore a temporary file is suitable.
This would allow for the script to run as many times there's still changes to be made, all you need is a temporary file that remembers which files that has been processed.
Upvotes: 1
Reputation: 586
If you have to do this operation only once and you have to replace with static content you can use Dreamwaver or other editor, so you will not need PHP. It will be much faster.
Still, if you do need to do this dynamically with PHP (you need database records or others) you can use shell commands via exec - google search for search-replace
Upvotes: 0
Reputation: 2665
There are two ways of handling this, IMO:
Upvotes: 1
Reputation: 83
Well, considering that in PHP some String operations are faster than array operation, and you are still not satisfied with its speed, you could write external program as you mentioned, probably in some "lower level" language. I would recommend C/C++.
Upvotes: 1