Devendra Wani
Devendra Wani

Reputation: 121

Understanding implementation of DC3/Skew algorithm to create Suffix Array linear time

I am trying to understand implementation of linear time suffix array creation algorithm by Karkkainen, P. Sanders. Details of algorithm can be found here.

I managed to understand overall concept but failing to match it with provided implementation and hence not able to grasp it clearly.

Here are initial code paths which are confusing me.

As per paper : n0, n1, n2 represent number of triplets starting at i mod 3 = (0,1,2)

As per code : n0 = (n + 2) / 3, n1 = (n + 1) / 3, n2 = n / 3; => How these initialisations has been derived?

As per paper : We need to create T` which is concatenation of triplets at i mod 3 != 0

As per code : n02 = n0 + n2; s12 = [n02] ==> How came n02? It should be n12 i.e n1 + n2.

As per code : for (int i = 0, j = 0; i < n + (n0 - n1); i++) fill s12 with triplets such that i%3 != 0; => Why for loop runs for n + (n0 - n1) times ? It should be simply n1 + n2. Should't be ?

I am not able to proceed because of these :( Please to help.

Upvotes: 3

Views: 3337

Answers (1)

ezorita
ezorita

Reputation: 126

Consider the following example where the length of the input is n=13:

 STA | CKO | WER | FLO | W

As per code : n0 = (n + 2) / 3, n1 = (n + 1) / 3, n2 = n / 3; => How these initialisations has been derived?

Note that the number of triplets i mod3 = 0 is n/3 if n mod3 = 0 and n/3+1 otherwise (if n mod3 = 1 or n mod3 = 2). In the current example n/3 = 4 but since the last triplet 'W' is not complete it is not counted in the integer division. A 'trick' to make this computation directly is to use (n+2)/3. Effectively, if n mod3 = 0 then the result of the integer divisions (n+2)/3 and n/3 will be the same. However, if n mod3 = 1 or 2 then the result of (n+2)/3 will be n/3+1. The same applies to n1 and n2.

As per code : n02 = n0 + n2; s12 = [n02] ==> How came n02? It should be n12 i.e n1 + n2. As per code : for (int i = 0, j = 0; i < n + (n0 - n1); i++) fill s12 with triplets such that i%3 != 0; => Why for loop runs for n + (n0 - n1) times ? It should be simply n1 + n2. Should't be ?

Both questions have the same answer. In our example we'd have a B12 buffer like this:

B12 = B1 U B2 = {TA KO ER LO}

So you'd first sort the suffixes and end up with a suffix array of B12, which has 8 elements. To proceed to the merging step we first need to compute the suffix array of B0, which is obtained by sorting the tuples (B0(i),rank(i+1))... But this concrete case in which the last triplet has only one element (W) has a problem, because rank(i+1) is not defined for the last element of B0:

B0 = {0,3,6,9,12}

which sorted alphabetically results in

SA0 = {3, 9, 0, ?, ?}

Since the indices 6 and 12 contain a 'W', it is not enough to sort alphabetically, we need to check which goes first in the rank table, so let's check the rank of their suffixes.. oh, wait! rank(13) is not defined!

And that's why we add a dummy 0 to the last triplet of the input when the last triplet only contains one element (if n mod3 = 0). So then the size of B12 is n0+n2, no matter the size of n1, and one needs to add an extra element to B12 if B0 is larger than B1 (in which case n0-n1 = 1).

Hope it was clear.

Upvotes: 2

Related Questions