MicSim
MicSim

Reputation: 26806

Map strings to numbers maintaining the lexicographic ordering

I'm looking for an algorithm or function that is able to map a string to a number in such way that the resulting values correspond the lexicographic ordering of strings. Example:

"book" -> 50000
"car"  -> 60000
"card" -> 65000
"a longer string" -> 15000
"another long string" -> 15500
"awesome" -> 16000

As a function it should be something like: f(x) = y, so that for any x1 < x2 => f(x1) < f(x2), where x is an arbitrary string and y is a number.

If the input set of x is finite, then I could always do a sort and assign the proper values, but I'm looking for something generic for an unlimited input set for x.

Upvotes: 2

Views: 3556

Answers (6)

Konstantin Burlachenko
Konstantin Burlachenko

Reputation: 5685

I have post a question here https://stackoverflow.com/questions/22798824/what-lexicographic-order-means As workaround you can append empty symbols with code zero to right side of the string, and use expansion from case II.

Without such expansion with extra empty symbols I' m actually don't know how to make such mapping.... But if you have a finite set of Symbols (V), then |V*| is eqiualent to |N| -- fact from Disrete Math.

Upvotes: 0

Geert Baeyaert
Geert Baeyaert

Reputation: 303

what you are asking for is a a temporary suspension of the pigeon hole principle (http://en.wikipedia.org/wiki/Pigeonhole_principle).

The strings are the pigeons, the numbers are the holes. There are more pigeons than holes, so you can't put each pigeon in its own hole.

Upvotes: 2

Pillsy
Pillsy

Reputation: 9901

As a corollary to Jason's answer, if you can map your strings to rational numbers, such a mapping is very straightforward. If code(c) is the ASCII code of the character c and s[i] is theith character in the string s, just sum like follows:

result <- 0
scale  <- 1
for i from 1 to length(s) 
  scale <- scale / 26
  index <- (1 + code(s[i]) - code('a'))
  result <- result + index / scale
end for
return result

This maps the empty string to 0, and every other string to a rational number between 0 and 1, maintaining lexicographical order. If you have arbitrary-precision decimal floating-point numbers, you can replace the division by powers of 26 with powers of 100 and still have exactly representable numbers; with arbitrary precision binary floating-point numbers, you can divide by powers of 32.

Upvotes: 3

Robert S. Barnes
Robert S. Barnes

Reputation: 40578

Maybe a Radix Tree is what you're looking for?

A radix tree, Patricia trie/tree, or crit bit tree is a specialized set data structure based on the trie that is used to store a set of strings. In contrast with a regular trie, the edges of a Patricia trie are labelled with sequences of characters rather than with single characters. These can be strings of characters, bit strings such as integers or IP addresses, or generally arbitrary sequences of objects in lexicographical order. Sometimes the names radix tree and crit bit tree are only applied to trees storing integers and Patricia trie is retained for more general inputs, but the structure works the same way in all cases.

LWN.net also has an article describing this data structures use in the Linux kernel.

Upvotes: 1

Hamish Grubijan
Hamish Grubijan

Reputation: 10830

You would be much better off writing a comparator which you can supply to a sort function. The comparator takes two strings and returns -1, 0, or 1. Even if you could create such a map, you still have to sort on it. If you need both a "hash" and the order, then keep stuff in two data structures - one that preserves the order, and one that allows fast access.

Upvotes: 1

jason
jason

Reputation: 241701

If you require that f map to integers this is impossible.

Suppose that there is such a map f. Consider the strings a, aa, aaa, etc. Consider the values f(a), f(aa), f(aaa), etc. As we require that f(a) < f(aa) < f(aaa) < ... we see that f(a_n) tends to infinity as n tends to infinity; here I am using the obvious notation that a_n is the character a repeated n times. Now consider the string b. We require that f(a_n) < f(b) for all n. But f(b) is some finite integer and we just showed that f(a_n) goes to infinity. We have a contradiction. No such map is possible.

Maybe you could tell us what you need this for? This is fairly abstract and we might be able to suggest something more suitable. Further, don't necessarily worry about solving "it" generally. YAGNI and all that.

Upvotes: 19

Related Questions