Samul
Samul

Reputation: 2019

Best method to intersect many arrays

I have a 2 dimensional array ($a), the first dimensional has 600 elements, the second dimension can have from 1k to 10k elements. Something like:

$a[0] = array(4 => true, 10 => true, 18 => true...);
$a[1] = array(6 => true, 10 => true, 73 => true...);
$a[599] = array(106 => true, 293 => true, 297 => true...);

I need to save in a file the $intersection of each of those 600 elements with each other 5x. Something like this:

for ($i=0;$i<=599;$i++) {

    for ($j=$i+1;$j<=599;$j++) {

        for ($k=$j+1;$k<=599;$k++) {

            for ($p=$k+1;$p<=599;$p++) {

                for ($m=$p+1;$m<=599;$m++) {

                     $intersection = array_intersect_key($a[$i], $a[$j], $a[$k], $a[$p], $a[$m]);

                }  

            }

        }
    
    }

}

I am using array_intersect_key because its much faster than array_intersect because it's O(n) vs O(n^2).

That code works fine but runs pretty slow. So I made maaany other attempts which runs a little faster, something like:

for ($i=0;$i<=599;$i++) {

    for ($j=$i+1;$j<=599;$j++) {

        $b = array_intersect_keys($a[$i], $a[$j]);

        for ($k=$j+1;$k<=599;$k++) {
             
            $c = array_intersect_keys($b, $a[$k]);

            for ($p=$k+1;$p<=599;$p++) {

                $d = array_intersect_keys($c, $a[$p]);

                for ($m=$p+1;$m<=599;$m++) {

                     $intersection = array_intersect_key($d, $a[$m]);

                }  

            }

        }
    
    }

}

This runs 3x faster than the 1st code, but it's still pretty slow. I also tried creating long binary vectors like 00000011001110... (for each 600 elements) and doing bitwise operations && which works exactly like an intersect, but it's not much faster and uses a lot more memory.

So I am wondering, do you have any breakthrough suggestion? Any mathematical operation, matrix operation... that I can use? I am really tempted to recreate the C+ code of the array_intersect_keys because at every execution it does a lot of things that I dont need, like ordering the arrays before performing the intersection - I dont need that because I already ordered all my arrays before operations start, so that creates a big overhead. But I still dont want to go that route because it's not that simple to create a PHP extension that performs better than the native implementation.

Note: my array $a has all the elements already sorted and the elements in the first dimension that have less elements in the second dimension are positioned first in the array so array_intersect_keys works much faster because that function has to check only until reach the end of the first parameter (which is smaller in size).

EDIT

@user1597430 Again I appreciate all your attention and further explanations you gave me. You made me complete understand everything you said! And after that, I decided to take a shot on my own implementation of your algorithm and I think it is a little bit simpler and faster because I never need to use sort and I also make good use of array keys (instead of values). Feel free to use the algorithm below whenever you want if you ever need it, you already provided me with lots of your time!

After implementing the algorithm below I just realized one thing at the end! I can ONLY start checking the intersections/combinations after all the combinatory phase is done, and after that, I need to run another algorithm to parse the results. To me that's not possible because using 600x1000 takes already a long time and I may very well end up having to use 600x20000 (yeap, 600 x 20k) and I am pretty sure it's gonna take forever. And only after "forever" is that I will be able to check the combinations/intersections result (parse). Do you know a way in such a manner that I can already check the intersections after each computation or not having to wait ALL the combinations be generated? That's pretty sad that after taking almost 6 hours to understand and implement the algorithm below, I just came to the conclusion it wont fit my need. Dont worry, I will accept your answer if no better one come, since you already helped me a lot and I LEARNED A LOT WITH you!

<?php

$quantity_first_order_array = 10;
$quantity_second_order_array = 20;







$a = array();
$b = array();

for ($i=0;$i<$quantity_first_order_array;$i++) {
    
    for ($j=0;$j<$quantity_second_order_array;$j++) {
  
        //I just use `$i*2 + $j*3 + $j*$i` as a way to avoid using `rand`, dont try to make sense of this expression, it's just to avoid using `rand`.
        $a['group-' . $i][$i*2 + $j*3 + $j*$i] = true;

        $b[$i*2 + $j*3 + $j*$i] = true;

    }       
    
}



//echo "aaaa\n\n";var_dump($a);exit();



$c = array();

foreach ($b as $chave1 => $valor1) {
    
    $c[$chave1] = array();
    
    foreach ($a as $chave2 => $valor2) {
    
        if (isset($a[$chave2][$chave1])) {
            
            $c[$chave1][] = $chave2;
            
        }
        
    }
    
}

foreach ($c as $chave1 => $valor1) {
    
    if (count($c[$chave1]) < 5) {
    
        unset($c[$chave1]);
        
    }
    
}


//echo "cccc\n\n";var_dump($c);exit();




$d = array();

foreach ($c as $chave1 => $valor1) {
    
    $qntd_c_chave1 = count($c[$chave1]);
    
    for ($a1=0;$a1<$qntd_c_chave1;$a1++) {
        
        for ($a2=$a1 + 1;$a2<$qntd_c_chave1;$a2++) {
            
            for ($a3=$a2 + 1;$a3<$qntd_c_chave1;$a3++) {
                
                for ($a4=$a3 + 1;$a4<$qntd_c_chave1;$a4++) {
                    
                    for ($a5=$a4 + 1;$a5<$qntd_c_chave1;$a5++) {
                        
                        $d[$c[$chave1][$a1] . '|' . $c[$chave1][$a2] . '|' . $c[$chave1][$a3] . '|' . $c[$chave1][$a4] . '|' . $c[$chave1][$a5]][] = $chave1;
                        
                    }
                    
                }
                
            }
            
        }
        
    }
    
}

echo "dddd\n\n";var_dump($d);exit();


?>

Upvotes: 2

Views: 96

Answers (1)

user1597430
user1597430

Reputation: 1146

Well, I usually don't do this for free, but since you mentioned genetic algos... Hello from Foldit community.

<?php

# a few magic constants

define('SEQUENCE_LENGTH',           600);
define('SEQUENCE_LENGTH_SUBARRAY', 1000);

define('RAND_MIN',      0);
define('RAND_MAX', 100000);

define('MIN_SHARED_VALUES', 5);

define('FILE_OUTPUT', sys_get_temp_dir() . DIRECTORY_SEPARATOR . 'out.csv');


# generates a sample data for processing

$a = [];

for ($i = 0; $i < SEQUENCE_LENGTH; $i++)
{
    for ($j = 0; $j < SEQUENCE_LENGTH_SUBARRAY; $j++)
    {
        $a[$i][] = mt_rand(RAND_MIN, RAND_MAX);
    }

    $a[$i] = array_unique($a[$i]);

    sort($a[$i]);
}


# prepares a map where key indicates the number
# and value is a sequence that contains this number

$map = [];

foreach ($a as $i => $array)
{
    foreach ($array as $value)
    {
        $map[$value][$i] = $i;
    }
}

ksort($map);


# extra optimization: drop all values that don't have at least
# MIN_SHARED_VALUES (default = 5) sequences
foreach ($map as $index => $values)
{
    if (count($values) < MIN_SHARED_VALUES)
    {
        unset($map[$index]);
    }
    else
    {
        # reindex array keys - we need that later for simple
        # "for" loops that should start from "0"
        $map[$index] = array_values($map[$index]);
    }
}


$file = fopen(FILE_OUTPUT, 'w');

# permutation: generates all sequences that share the same $value
foreach ($map as $value => $array)
{
    $array_length = count($array);

    for ($i = 0; $i < $array_length; $i++)
    {
        for ($j = $i + 1; $j < $array_length; $j++)
        {
            for ($k = $j + 1; $k < $array_length; $k++)
            {
                for ($l = $k + 1; $l < $array_length; $l++)
                {
                    for ($m = $l + 1; $m < $array_length; $m++)
                    {
                        # index indicates a group that share a $value together
                        $index = implode('-', [$array[$i], $array[$j], $array[$k], $array[$l], $array[$m]]);

                        # instead of CSV export you can use something like
                        # "$result[$index][] = $value" here, but you need
                        # tons of RAM to do that
                        fputcsv($file, [$index, $value]);
                    }
                }
            }
        }
    }
}

fclose($file);

Locally script makes ~6.3 million permutations, that takes ~15s in total, 70% of the time goes to the I/O HDD operations.

You are free to play with the output method (by using linux sort out.csv -o sorted-out.csv on the top of your CSV file or by creating a separate file pointer for each unique $index sequence to dump your results - it's up to you).

Upvotes: 1

Related Questions