Reputation: 61

Segmentation of strings using dynamic programming

Suppose you have a function quality(x) that returns the quality of a sequence of letters x. Given a string such as "howareyoutoday," what is the most efficient way to determine that the segmentation is "how are you today" (i.e. quality(how)+quality(are)+quality(you)+quality(today) is the maximum quality possible)?

I was thinking that we could have something like the following:

A[0] = h, A[1] = o, ..., A[n] = y 
Q[0] = quality(A[0]), Q[1] = quality(A[0]A[1]), ..., Q[n] = quality(A[0]...A[n])

Now to determine the segmentation, we find max{Q[0], .., Q[n]} which will return some Q[i] (the first space is after this). Then, we find max{Q[i+1], .. Q[n]} which returns another Q[i] (second space is after this), etc. until max returns Q[n].

I have two questions: does this method use dynamic programming? It seems to me that it does, since we build the initial Q with subproblems to the original problem. Also, is this an optimal solution? To my understanding, the worst case would be O(n^2), which would be when max returns Q[0], then Q[1], then Q[2], etc.

Upvotes: 1

Answers (2)

j_random_hacker

Reputation: 51226

One way to formulate this as a DP problem is to compute a function f(i):

f(i) = The highest sum of quality scores achievable on the first i characters.

What we notice is that the last segment must end at position i-1, and it could begin anywhere before it, right back to position 0 (assuming we number the positions starting at 0). Let's call the starting position of this segment j; we'll need to try all possible values of j, and find the one that's best. What should we combine this final segment A[j]..A[i-1] with? With the optimal solution to whatever remains to its left, which is conveniently given by f(j). So we get:

f(i) = max { f(j) + quality(A[j]..A[i-1]) } over all 0 <= j <= i.

with f(0) = 0 as a boundary case.

f(n+1) will calculate the optimal score for the entire string in quadratic time and using linear memory requirements. To actually calculate where the word breaks should appear to get this score, you need to walk backwards from f(n+1), determining which j value caused it to achieve its maximum (there may be several; in that case, any will work). Keep repeating this until you get j = 0. (Alternatively you can store these "maximising" values of j in a separate array as you compute f(). This array is often called "pred[]", for "predecessor".)

[EDIT]

The above is a recursion to solve the problem. It will solve it correctly, but very slowly for large n. All DP algorithms can be stated this way, but to actually turn it from a plain recursion into a DP algorithm you need to either memoize it, or figure out how to make a bottom-up table-filling iterative algorithm from it. Memoization of a recursion is trivial: all you need to do is, whenever the function computes the answer to a fresh value of i, store this answer (here, an array of size n+1 will do), and use it next time that particular value of f(i) is asked for. Bottom-up DP involves figuring out an order in which the table can be populated so that all dependencies are met. This can be (a constant factor) faster, but is sometimes harder to do.

Upvotes: 2

Zim-Zam O'Pootertoot

Reputation: 18148

You can use dynamic programming if you can make use of cached results of sub-problems. For example, if quality(A[0]A[1]) is a function of quality(A[0]) and quality(A[1]) (e.g. quality(A[0]A[1]) = quality(A[0]) + quality(A[1])) then you can benefit by caching the results of the sub-problems; if quality(A[0]A[1]) is independent of quality(A[0]) and quality(A[1]) then you won't benefit by caching the results of the sub-problems and you can't use dynamic programming.

The worst case of your algorithm is O(n^2). You might be able to make use of a potentially sub-optimal O(n) solution: test quality(A[0]), then quality(A[0]A[1]), then quality(A[0]A[1]A[2]), and so on; halt when the quality of the current solution is worse than the quality of the previous solution, so if quality(A[0]A[1]) > quality(A[0]A[1]A[2]) then partition at A[0]A[1] and begin the search for the next word at A[2]. You can improve the quality of the algorithm at the cost of a larger constant factor by increasing the lookahead; the algorithm just described has a lookahead(1), whereas a lookahead(2) algorithm would partition at A[0]A[1] if quality(A[0]A[1]) > quality(A[0]A[1]A[2]) and quality(A[0]A[1]) > quality(A[0]A[1]A[2]A[3]).

Upvotes: 1

Segmentation of strings using dynamic programming

Answers (2)

Related Questions