Reputation: 108
The Re-Pair algorithm performs text compression by recursively replacing the most frequent pair of characters occurring in the text. The algorithm gets very slow on large texts, I was wondering how parellize the following algorithm with multithread to improve compression speed:
Pseudo code for Re-Pair compression for ASCII text compression
rules = {}
pair_counts = {}
next_pair_symbol = 256
while true:
for (i = 0 .. text.length() - 2)
pair_counts[text[i].substring(i, i + 1)]++
most_frequent_pair = pair_counts.getPairMaxCount()
if (most_frequent_pair < 2)
break
rules[next_pair_symbol] = most_frequent_pair
text.replace(most_frequent_pair, next_pair_symbol++)
The final value of the variable rules is used to retrieve the initial text from the compressed text.
Before spending time coding, I'm wondering about the correct way to implement multithreading with Re-Pair: one possible solution to make the Re-Pair algorithm multithreaded is to split the given text into N slices, where N is the number of threads; then executes N threads in parallel, each one on its own slice, which result in specific rules[]. When all threads terminated, merge together rules[] from each thread and apply the merged rules on the given text to get the final compressed string.
The compression will not be optimal in case each slice has different statistic distribution. There are other better alternatives to be less sensible to the distribution of the characters in the given text?
Upvotes: 2
Views: 65