Reputation: 1812
I am writing a script that will repeatedly search a large group of arrays (40,000) and merge all of the arrays that have at least one common element. I have tried array_intersect()
, but I found that it was too slow for this application. Is there another function that is faster and simply returns true if at least one element is shared between two arrays?
It is my assumption that array_intersect()
is slowed down by the fact that both arrays are reviewed completely and the commons values are grouped together and returned. It would be faster to exit when a single match is found.
To clarify: All arrays are held with an another master array (which is a 2d array.) If it is discovered that the arrays stored at $master[231] and $master[353] both contain the element 124352354, they should be merged into an new array and the result stored in a different 2d array designed to store merged results.
Current code:
$test = array_intersect($entry, $entry2);
if($test){
...
}
A better method is:
foreach($entry as $item){
if(in_array($item, $entry2)){
$test = true;
break;
}
}
if($test){
...
}
and another improvement is using isset() and array_flip() instead of in_array();
$flipped = array_flip($entry2);
foreach($entry as $item){
if(isset($flipped[$item]){
$test = true;
break;
}
}
if($test){
...
}
Upvotes: 2
Views: 2568
Reputation: 32272
SELECT *
FROM contacts t1 INNER JOIN contacts t2
ON t1.phone = t2.phone
AND t1.AccountID < t2.AccountID
Also, if your system may ever grow to include international phone numbers you should store them as a string type. There are countries, I believe in Europe, that use leading zeroes in their phone numbers and you cannot properly store them with a numeric type.
The below query will return all instances of phone numbers used multiple times with no duplicate rows no matter how many accounts are sharing a phone number:
SELECT DISTINCT t1.AccountID, t1.phone
FROM contacts t1 INNER JOIN contacts t2
ON t1.phone = t2.phone
AND t1.AccountID != t2.AccountID
ORDER by t1.phone
I'd include a SQLfiddle, but it seems to be broken atm. This is the schema/data I used as a test:
CREATE TABLE IF NOT EXISTS `contacts` (
`AccountID` int(11) NOT NULL,
`phone` varchar(32) NOT NULL,
KEY `aid` (`AccountID`),
KEY `phn` (`phone`)
)
INSERT INTO `contacts` (`AccountID`, `phone`) VALUES
(6, 'b'),
(1, 'b'),
(1, 'c'),
(2, 'd'),
(2, 'e'),
(3, 'f'),
(3, 'a'),
(4, 'a'),
(5, 'a'),
(1, 'a');
Upvotes: 0
Reputation: 1988
Busting-up pre-conceived notions all around
I both proved my point, and undermined it, in one answer. TL;DR = Do this in SQL.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
PHP built-ins are code-level cleaner, but surprisingly inefficient.
If you are going by key, your best bet would be the tried and true for-loop:
$array1 = array( ... );
$array2 = array( ... );
$match = FALSE;
foreach($array1 as $key => $value) {
if(isset($array2[$key]) && $array2[$key] == $value) {
$match = TRUE;
break;
}
}
It may seem counter-intuitive, but at the execution level, no mater what, you have to iterate over every item in at least one array. You can keep this shorter by only doing it on the shorter array.
array_intersect()
keeps going for every key in both arrays, so, although it looks crazy, you just have to do it the "dirty" way.
If the data comes from a database, it would actually be faster to have the SQL engine do the lifting. A simple join with a limit 1 will give you a flag to know if there are duplicates, and then you can execute another query to get the merged data (dynamically generate the query against multiple tables or source queries if you need to do this on more than one pair).
SQL will be magnitudes faster than any higher-level language, like PHP, for doing this. I don't care if you already have the arrays in memory, executing the query and loading the new array from the DB will be faster than trying to do the compare and then merge in resident memory of the App...
Again, with counter-intuitive things...
So, this is interesting:
I made a test script for this at http://pastebin.com/rzeQnyu2
With the matcher (phone number) at 5 digits, the foreach
loop consistently executed in 1/100 the time of the other option. HOWEVER, up that to 10 digits (removing all possibility of collision) and the foreach
jumps to 36x slower than the other option.
# 5 digit key (phone number)
Match found
For Each completed in 0.0001
Intersect Key completed in 0.0113
# 10 digit key
Match not found
For Each completed in 0.2342
Intersect Key completed in 0.0064
I wonder why the second option (which likely had larger arrays) was faster for Intersect than the smaller one... WEIRD...
This is because, while the intersect always iterates over all items, the foreach
loop wins when it can exit early, but looks to be VERY slow if it doesn't get that opportunity. I would be interested in the deeper technical reasons for this.
EIther Way - in the end, just do it in SQL.
Upvotes: 1
Reputation: 10034
Assuming you want to just discover if two arrays have a common element, you could create your own getIntersect function which would be faster than using array_intersect since it would return instantly on first match.
function getIntersect($arr1, $arr2)
{
foreach($arr1 as $val1)
{
foreach($arr2 as $val2)
{
if($val1 == $val2)
{ return true; }
}
}
return false;
}
Assuming that what you really want to find is arrays in which one element occurs at least more than once.
Then you could easily have
function hasCommonElements($arr)
{
for($i = 0; $i < count($arr); $i++)
{
$val = $arr[$i];
unset($arr[$i]);
if(in_array($val, $arr))
{
return true;
}
}
}
And you could easily get an array of all arrays containing common elements using array_filter
:
array_filter($my40k, "hasCommonElements");
Assuming that what you really want to do is to find all arrays which have at least one value in common, you have to do a higher level array filter.
$mybigarray;//your big array
function hasIntersects($arr)
{
for($i = 0; $i < count($mybigarray); $i++)
{
if(getIntersect($arr, $mybigarray[$i]))
{
return true;
}
}
}
Then call our filter monster
array_filter($mybigarray, "hasIntersects");
Disclaimer: None of this stuff is tested. Check for typos
Upvotes: 3
Reputation: 17028
If your arrays contain only values that are also valid keys (integers, ...), it could be better to flip the arrays (swap keys and values), which technically means build and index, and search on the keys. Examples:
function haveCommonValues1($a1, $a2) {
$a1_flipped = array_flip($a1);
foreach ($a2 as $val) {
if (isset($a1_flipped[$val])) {
return true;
}
}
return false;
}
or if you need the intersection:
function haveCommonValues2($a1, $a2) {
$a1_flipped = array_flip($a1);
$a2_flipped = array_flip($a2);
return array_intersect_key($a1_flipped, $a2_flipped);
}
On some test arrays i got these results, this however depends highly on the array structures. So you need to test it and compare the times.
array_intersect : 0m1.175s
haveCommonValues1 : 0m0.454s
haveCommonValues2 : 0m0.492s
Upvotes: 2
Reputation: 1199
array_intersect has a runtime of O(n * log(n)) because it uses a sorting algorithm before the comparison itself. Depending on your input you can improve that in many different ways (e.g. if you have integers from a small range you may impelement the algorithm using counting sort).
You can find an example for that optimizations right here or here. A possible solution where you won't need sorting is posted in this thread. It also has linear time so I guess this is what your are looking for.
Upvotes: 0