user1111426
user1111426

Reputation: 21

Find proximity matched pairs between two sets

Given two sets of keywords, where each keyword got start and end offset (e.g. keyword "abc" starts at offset 23 and ends at offset 25), I would like to efficiently find matching pairs between those sets. a matched pair is a keyword from set1 and a keyword from set2, where one keyword starts after the other keyword ends, but no more than MAX_PROXIMITY characters between the end of the one to the start of the other. in addition, each keyword can belong only to one pair (matched keyword cannot be reused for another match).

Upvotes: 2

Views: 240

Answers (2)

tkleczek
tkleczek

Reputation: 338

You can use dynamic programming to solve this problem.

Suppose you have each set of keywords ordered by the offset at which they start. Let's define

set1_keyword[i] # i-th keyword in the first set (ordered by the start offset)
set2_keyword[j] # same for the second set
max_pairs[i][j] # number of pairs in optimal assignment between keywords 1..i from the first set and keywords 1..j from the second set

Of course max_pairs[n1][n2] is the answer to your problem (wheren1 and n2 are the sizes of the first and second keyword set respectively)

To compute max_pairs table use the following formula

if (set1_keyword[i] matches set2_keyword[j])
    max_pairs[i][j] = max_pairs[i-1][j-1] + 1
else
    max_pairs[i][j] = max(max_pairs[i-1][j], max_pairs[i][j-1])

This basically says that if you can match the keywords you do it, because for the problem with keywords 1..i and 1..j it's the best you can do. In the other case (i-th and j-th keywords don't match) you cannot have a solution in which both i-th and j-th keywords are paired to some different keywords. So in the optimal solution either i-th keyword or j-th keyword should be unpaired. That basically tells us to look at the (already computed) solutions for problems max_pairs[i-1][j] (without i-th keyword) or max_pairs[i][j-1] (without j-th keyword) and choose the better of the two.

If you compute this table in the correct order i.e

for (int i = 0; i < n1; ++i)
    for (int j = 0; j < n2; ++j)
        # compute max_pairs[i][j] here

the algorithm will have O(n1*n2) complexity, which is better than the assignment problem in bipartite graph (which runs in O(n^3))

For more on dynamic programming please refer to dynamic programming

Upvotes: 0

FUD
FUD

Reputation: 5184

You could formulate it as maximum matching in a bipartite graph. Consider the two sets you have as two sets of vertices and generate edges between all the vertices from the first set to all the vertices in second set which satisfy your rule i.e. " where one keyword starts after the other keyword ends, but no more than MAX_PROXIMITY characters between the end of the one to the start of the other"

Once you have the graph in place run a maximum matching algorithm in a bipartite graph. http://en.wikipedia.org/wiki/Matching_(graph_theory)#Maximum_matchings_in_bipartite_graphs

Upvotes: 2

Related Questions