Reputation: 3978
I have two sets of data in this form:
x | y | z x1 | y1 | z1
ab1 | 1 | 2 ab1 | 1 | 2
ab1 | 2 | 3 ab1 | 1.8 | 2
ab2 | 2 | 3 ab1 | 1.8 | 2
The number of columns can change between 1 to 30. The number of rows of the two sets is likely to be different. The average amount of rows per set can change between few hundreds to few millions. For each column a different matching rule will be applied, for example:
x: perfect match
y: +/- 0.1
z: +/- 0.5
Two rows are equivalent when all the criterias are satisfied. My final goal is to find the rows in the first set with no match in second set.
The naive algorithm could be:
foreach a in SetA
{
foreach b in SetB
{
if (a == b)
{
remove b from SetB
process the next element in SetA
}
}
log a is not in SetB
}
At this stage I am not very interested in the efficiency of the algorithm. I am sure I could do better and I could reduce the complexity. I am more concern about the correctness of the result. Let's try with a very simple example. Two sets of number:
A B
1.6 1.55
1.5 1.45
4 3.2
And two elements are equal if:
b + 0.1 >= a >= b - 0.1
Now, if I run the naive algorithm I will find 2 matches. However the result of the algorithm depends on the order of the two sets. For example:
A B
1.5 1.55
1.6 1.45
4 3.2
The algorithm will find only one match.
I would like to find the maximum number of matching rows.
I reckon in the real world data one of the columns will store an id, so the number of possible multiple matches will be a much smaller subset of the original set. I know I can try to face this problem with a post processing after the first scan. However, I don't want reinventing the wheel and I am wondering if my problem is equivalent to some famous, well known and already solved problem.
PS: I have tagged the question also as C++, C# and Java because I am going to use one of these languages to implement it.
Upvotes: 0
Views: 550
Reputation: 66981
Two rows are equivalent when all the criteria are satisfied. My final goal is to find the rows in the first set with no match in second set.
foreach a in SetA
{
foreach b in SetB
{
if (a == b) //why would you alter SetB at all
go to next A
}
remove a from SetA //log a is not in SetB
}
However, you are right, that this is equivalent to some famous, well known and already solved problem
. It's called "Set Difference". It's... kind of a major part of set theory. And since all those languages have sets, they also have that algorithm. C++ even has a dedicated function for it. Approximate Complexity of all of these is O(2(A+B)-1)
.
C++ standard algorithm function: http://www.cplusplus.com/reference/algorithm/set_difference/
vector<row> Result(A.rows());
end = std::set_difference(A.begin(), A.end(),
B.begin(), B.end(),
Result.begin());
Result.resize(end-Result.begin());
or std::unordered_set can be made to do this: http://msdn.microsoft.com/en-us/library/bb982739.aspx
std::unordered_set<row> Result(A.begin(), A.end());
for(auto i=B.begin(); i!=B.end(); ++i) {
auto f = Result.find(*i);
if (f != A.end())
A.erase(f);
}
Java does as well: http://download.oracle.com/javase/tutorial/collections/interfaces/set.html
Set<row> Result = new Set<row>(A);
A.removeAll(B);
And C#: http://msdn.microsoft.com/en-us/library/bb299875.aspx
HashSet<row> Result = new HashSet<row>(A);
A.ExceptWith(B);
Upvotes: 0
Reputation: 201
As I understand, you want the rows in the first set which don't match any row in the second set (within the error range). This cleaerly can be done with an O(n^2) complexity algorithm by parsing the elements in the first set and comparing them with the elements in the second set. An optimization could be this:
sort both the sets - O(n*ln(n))
eliminate from the start the elements too small or too big (within the error) from the first set - O(n)
look in the second set for elements from the first set using a binary search (within the error) - O(n*lg(2)) and eliminate those not suitable
comlexity O(n*ln(n))
Upvotes: 1
Reputation: 1356
Given your constraints, I dont see a way to do it in less than O(n^2). I'd probably modify your naive algorithm to include either a bool or a count field for each row in table A and then mark it if it matches a row in table B. Then post process it with std::partition based on the indicator to group all the unique and non unique rows together. If you go with a count, you could get the rows that were "least unique". The bool would be somewhat more efficient since you could break out of the loop over B at the first match.
Upvotes: 0
Reputation: 944
From the statement "My final goal is to find the rows in the first set with no match in second set." I understand that there can be multiple rows in first set that match the same row in the second set. In this case the solution is to remove the line "remove b from SetB" from your naive algorithm.
If however, you really need one to one matches between elements of the two sets then the answer with "maximum-bipartite-matching" provide by Corey Kosak applies.
Upvotes: 0
Reputation: 2625
It can be cast as a graph theory problem. Let X be a set that contains one node for each row in your first set. Let Y be another set which contains one node for each row in your second set.
The edges in the graph are defined by: for a given x in X and a given y in Y, there is an edge (x,y) if the row corresponding to x matches the row corresponding to y.
Once you have built this graph you can run the "maximum-bipartite-matching" algorithm on it and you will be done.
Upvotes: 1
Reputation: 3400
range tree? http://en.wikipedia.org/wiki/Range_tree i dont really know, just throwing ideas out there
Upvotes: 0