Matthias
Matthias

Reputation: 4163

compare two multidimension and associative arrays and get the difference

I need to compare two arrays like this

// currently in our database
$firstArr = array(
        ["id"=>1, "another_id"=>1],
        ["id"=>2, "another_id"=>4],
        ["id"=>5, "another_id"=>9]
);

// currently fetched from csv-files
$secondArr = array(
        ["id"=>6, "another_id"=>3],
        ["id"=>2, "another_id"=>7],
        ["id"=>1, "another_id"=>1]
);

The first array represents the data that is currently in our database, meanwhile the second one represents the data that is delivered from an csv-file.

In order to not delete the whole data from the database, I need to compare the two array's. If the csv-file offers data that is not in the database, I want to input these data sets. If the database includes data that is not in the csv-file, I want to remove them from the database.

I come up with a solution that works for a small amount of data in the arrays:

$new_to_database = array();
foreach($secondArr AS $arr){
    $in_database = array_filter(array_map(function($el) use ($arr){
        if($el['id'] == $arr['id'] && $el['another_id'] == $arr['another_id']){
            return $el;
        }
    }, $firstArr));

    if(count($in_database) === 0){
        $new_to_database[] = $arr;
    }               
}

var_dump($new_to_database);   // input later on
// array(2) { [0]=> array(2) { ["id"]=> int(6) ["another_id"]=> int(3) } [1]=> array(2) { ["id"]=> int(2) ["another_id"]=> int(7) } }

The problem is that each array includes round about 5000 data sets. As a result the script takes far to long and I get the error that the execution time of 360 seconds is exceeded.

How can I solve this? I guess the algorithm should be way more efficient.

Upvotes: 2

Views: 111

Answers (2)

AbraCadaver
AbraCadaver

Reputation: 78994

You can index each array by the id and then compute the difference in the keys:

// Get items that are in CSV but not in database
$new = array_diff_key(array_column($csv, null, 'id'), array_column($db, null, 'id'));

// Get items that are in database but not in CSV
$del = array_diff_key(array_column($db, null, 'id'), array_column($csv, null, 'id'));

Here's a DEMO using two simple arrays of 50,000 items each.

That's short and sweet but you may not want to run the array_column multiple times:

$csv = array_column($csv, null, 'id');
$db  = array_column($db, null, 'id');

// Get items that are in CSV but not in database
$new = array_diff_key($csv, $db);

// Get items that are in database but not in CSV
$del = array_diff_key($db, $csv);

To compare the entire array contents (this will work if you add more than just id and another_id), you can map each array and serialize into a string representation and compare. Note that the order of the elements matters as well:

// Get items that are in CSV but not in database
$new = array_map('unserialize',
                 array_diff(array_map('serialize', $csv), array_map('serialize', $db)));

// Get items that are in database but not in CSV        
$del = array_map('unserialize',
                 array_diff(array_map('serialize', $db), array_map('serialize', $csv)));

On the DEMO site this one worked for 40,000 array items but not 50,000. This will depend on your resources and php.ini settings.


There's also an array_intersect_key and array_intersect if you want to see what is the same in each array.

Upvotes: 2

RiWe
RiWe

Reputation: 377

You are doing a lot of loops. foreach, array_filter and array_map use a lot of time.

$new_to_database = array_udiff($secondArr, $firstArr, function ($s, $f) {
    if ($s['id'] == $f['id'] && $s['another_id'] == $f['another_id']) {
        return 0;
    } else {
        return -1;
    }
});

This gives the results you need with less iterations.

Upvotes: 1

Related Questions