Mutuelinvestor
Mutuelinvestor

Reputation: 3538

I'm looking for an algorithm or function that can take a text string and convert it a number

I looking for a algorithm, function or technique that can take a string and convert it to a number. I would like the algorithm or function to have the following properties:

  1. Identical string yields the same calculated value
  2. Similar strings would yield similar values (similar can be defined as similar in meaning or similar in composition)

  3. Capable of handling strings of variable length

I read an article several years ago that gives me hope that this can be achieved. Unfortunately, I have been unable to recall the source of the article.

Upvotes: 2

Views: 489

Answers (5)

Ben Allison
Ben Allison

Reputation: 7394

I think you're probably after a hash function, as numerous posters have said. However, similar in meaning is also possible, after a fashion: use something like Latent Dirichlet Allocation or Latent Semantic Analysis to map your word into multidimensional space, relative to a model trained on a large collection of text (these pre-trained models can be downloaded if you don't have access to a representative sample of the kind of text you're interested in). If you need a scalar value rather than multi-dimensional vector (it's hard to tell, you don't say what you want it for) you could try a number of things like the probability of the most probable topic, the mean across the dimensions, the index of the most probable topic, etc. etc.

Upvotes: 2

Has QUIT--Anony-Mousse
Has QUIT--Anony-Mousse

Reputation: 77485

Non-serious answer: Map everything to 0

Property 1: check. Property 2: check. Property 3: check.

But I figure you want dissimilar strings to get different values, too. The question then is, what is similar and what is not.

Essentially, you are looking for a hash function.

There are a lot of hash functions designed with different objectives. Crypographic hashes for examples are pretty expensive to compute, because you want to make it really hard to go backwards or even predict how a change to the input affects the output. So they try really hard to violate your condition 2. There are also simpler hash functions that mostly try to spread the data. They mostly try to ensure that close input values are not close to each other afterwards (but it is okay if it is predictable).

You may want to read up on Wikipedia:

https://en.wikipedia.org/wiki/Hash_function#Finding_similar_substrings

(Yes, it has a section on "Finding similar substrings" via Hashing)

Wikipedia also has a list of hash functions:

https://en.wikipedia.org/wiki/List_of_hash_functions

There is a couple of related stuff for you. For example minhash could be used. Here is a minhash-inspired approach for you: Define a few random lists of all letters in your alphabet. Say I have the letters "abcde" only for this example. I'll only use two lists for this example. Then my lists are:

p1 = "abcde"
p2 = "edcba"

Let f1(str) be the index in p1 of the first letter in my test word, f2(str) the first letter in p2. So the word "bababa" would map to 0,3. The word "ababab" also. The word "dada" would make to 0,1, while "ce" maps to 2,0. Note that this map is invariant to word permutations (because it treats them as sets) and for long texts it will converge to "0,0". Yet with some fine tuning it can give you a pretty fast chance of finding candidates for closer inspection.

Upvotes: 3

Atilla Ozgur
Atilla Ozgur

Reputation: 14721

Fuzzy hashing (context triggered piecewise hashing) may be what you are looking for.

Implemenation: ssdeep

Explanation of the algorithm: Identifying almost identical files using context triggered piecewise hashing

Upvotes: 2

Jonathon Ashworth
Jonathon Ashworth

Reputation: 1114

Similar in composition is pretty easy, I'll let somebody else tackle that.

Similar in meaning is a lot harder, but fun :), I remember reading an article about how a neural network was trained to construct a 2D "semantic meaning graph" of a whole bunch of english words, where the distance between two words represented how "similar" they are in meaning, just by training it on wikipedia articles.

You could do the same thing, but make it one-dimensional, that will give you a single continuous number, where similar words will be close to each other.

Upvotes: 3

goat
goat

Reputation: 31834

num = 0
for (byte in getBytes(str))
    num += UnsignedIntValue(byte)

This would meet all 3 properties(for #2, this works on the strings binary composition).

Upvotes: 0

Related Questions