Reputation: 57184
I was just kicking around the idea of breaking up a large group of text into a single integer by using recursive 2-Gram storage until there is only one value left.
table pair
{
id
first_parent_id (points to -> this.id)
second_parent_id (points to -> this.id)
}
For example, in the following code I have a 11 word sentence (twelve with the period). I could store each word pair in a database ("this" + "is" = ID #1) and then store each set of two wordpairs in the database (1 + 2 = ID #7), and repeat until I get down to only one word set left - which would be ID 12.
This is my group of words which I plan to compress.
---1---|--2-----|--3-----|-----4-|----5--|-------6-
-------7--------|--------8-------|-------9---------
----------------10---------------11----------------
------------------------12-------------------------
Then using the number "12" we can work backwards (if we have the same dataset)
------------------------12-------------------------
----------------10---------------11----------------
-------7--------|--------8-------|-------9---------
---1---|--2-----|--3-----|-----4-|----5--|-------6-
This is my group of words which I plan to compress.
While this would take a tremendous amount of work to compress/uncompress each string - it seems like it might have a use in some kind of archive work where the contents need to be stored - but are never read except in rare cases where the uncompression process isn't a problem.
Am I thinking about this correctly? Would the possible number of word sequences just be too great to store like this? (Imagine a 500 word document).
Upvotes: 2
Views: 1452
Reputation: 1371
Why do you need "digram words" to achive compression? If that's not a strictly requirement, there are various method to compress textual data with different scenerio. Those are mostly called dictionary preprocessing. Here is a list that can be applied in your case:
Count word occurrences and sort them by frequencies in descending order. You can use top N words with your custom encoding method where N can be configurable by the user. You can even optimize N with dynamic programming or such. At actual encoding, encode a flag to indicate next symbol is whether a dictionary word or a directly encoded word.
Build a histogram of digram or trigram character combinations (including spaces, punctuation etc). Then use unused byte values to encode those digram or trigrams which are frequently seen. You can even use recursive methods to scan over and over again to reduce source file.
In your case, it's inefficient if you consider above methods. Because, seems you didn't consider that you need a really big data to decode your encoded data. To understand most of compression ideas, it's better to write a very simple test program to analyze it's output. You'll end up with a stronger and stable algorithms eventually.
Here is some dictionary preprocessors which come into my mind for just giving you a reference:
Upvotes: 2
Reputation: 5674
In short, yes the possible number of sequences would likely be too great to do this efficiently. The bigger problem is that those word mappings, and the n-grams following each of those mappings, would need to be stored somewhere, which would greatly outweigh any savings of the actual "compression."
Upvotes: 1