Keith
Keith

Reputation: 155872

Compress a short but repeating string

I'm working on a web app that needs to take a list of files on a query string (specifically a GET and not a POST), something like:

http://site.com/app?things=/stuff/things/item123,/stuff/things/item456,/stuff/things/item789

I want to shorten that string:

http://site.com/app?things=somekindofencoding

The string isn't terribly long, varies from 20-150 chars. Something that short isn't really suitable for GZip, but it does have an awful lot of repetition so compression should be possible.

I don't want a DB or Dictionary of strings - the URL will be built by a different application to the one that consumes it. I want a reversible compression that shortens this URL. It doesn't need to be secure.

Is there an existing way to do this? I'm working in C#/.Net but would be happy to adapt an algorithm from some other language/stack.

Upvotes: 2

Views: 423

Answers (2)

Mark Adler
Mark Adler

Reputation: 112547

You can try zlib using raw deflate (no zlib or gzip headers and trailers). It will generally provide some compression even on short strings that are composed of printable characters and does look for and take advantage of repeated strings. I haven't tried it, but could also see if smaz works for your data.

I would recommend obtaining a large set of real-life example URLs to use for benchmark testing of possible compression approaches.

Upvotes: 0

Rune FS
Rune FS

Reputation: 21752

If you can express the data in BNF you could contruct a parser for the data. in stead of sending the data you could send the AST where each node would be identified as one character (or several if you have a lot of different nodes). In your example

we could have

files : file files
      | 
file : path id
path : itemsthing
     | filesitem
     | stuffthingsitem

you could the represent a list of files as path[id1,id2,...,idn] using 0,1,2 for the paths and the input being:

/stuff/things/item123,/stuff/things/item456,/stuff/things/item789
/files/item1,/files/item46,/files/item7

you'd then end up with ?things=2[123,456,789]1[1,46,7]

where /stuff/things/item is represented with 2 and /files/item/ is represented with 1 each number within [...] is an id. so 2[123] would expand to /stuff/things/item123

EDIT The approach does not have to be static. If you have to discover the repeated items dynamically you can use the same approach and pass the map between identifier and token. in that case the above example would be

?things=2[123,456,789]1[1,46,7]&tokens=2=/stuff/things/,1=/files/item

which if the grammar is this simple ofcourse would do better with

?things=/stuff/things/[123,456,789]/files/item[1,46,7]

compressing the repeated part to less than the unique value with such a short string is possible but will most likely have to be based on constraining the possible values or risk actually increasing the size when "compressing"

Upvotes: 1

Related Questions