Reputation: 236
I am studying a problem which reduces to the cryptanalysis of a lengthy monoalphabetic substitution ciphertext written in a known language. This problem is easy enough to be solved by hand using frequency analysis and word patterns, as described in Sinkov's Elementary Cryptanalysis. I'm having trouble finding a theoretically validated algorithm for it: Joux's Algorithmic Cryptanalysis doesn't even cover such rudimentary substitution, and I got nothing out of Gaines's Cryptanalysis: A Study of Ciphers and Their Solution (what other resources should I see?).
Some approaches are pretty obvious. Deciding on each substitution in sequence, then leveraging what is already known, works only if no mistake is made along the way. Employing metaheuristic optimization -- for example, reassign letters until the number of valid words found is maximized -- makes it hard to tell when the search is over. Perhaps a dynamic programming approach to testing the variations would be best. Or, the answers to this question contain other possibly naïve methods.
What are the preferred algorithms for solving this kind of problem?
Upvotes: 1
Views: 675
Reputation: 8741
This paper describes an algorithm to solve homophonic substitution ciphers of which alphabet of length 26 (a-z) is a special case. Their reference implementation in C++ is available here.
The basic idea behind the algorithm (hill climb using digram frequencies to avoid having to decipher more than once) comes from this paper and is (by the authors of the first paper) described as the fastest known algorithm to solve monoalphabetic substitution ciphers.
Upvotes: 1
Reputation: 51216
If we assume (unrealistically) that each letter in a given language appears with some known probability independently of nearby letters, then the exact maximum likelihood estimate can be found fairly efficiently. This is about the best you could hope for with this model.
Let's suppose there are n letters in the message and k letters in the alphabet.
The basic idea is that we somehow come up with a numeric score for each of the k^2 possible letter pairs, and then look for a maximum-weight bipartite perfect matching on the two alphabets -- that is, we choose pairs of letters in such a way that every letter in one alphabet is paired with exactly one letter in the other, and the sum of the scores of the chosen pairs is maximised. There are 2 main issues: (1) How to define the score of a given pair; (2) How to solve the general problem of finding the highest-sum subset of pairs once we have decided on the individual scores.
The score of pairing the encoded letter x with the source letter y should be log(p(y)^freq(x)), where p(y) is the known probability of letter y in the language, and freq(x) is the number of times letter x appeared in the encoded input. Why? Because under this simple model of the source language that ignores the effect of nearby characters, the probability of generating any given n-character source-language string is equal to the product of the probabilities of the letters that appear in it, which you can also compute as the product of p(y)^freq(y) over all the letters y in the source alphabet. So to compute the probability of generating that string given that each x is "really" a y (and nothing else is a y), we change freq(y) to freq(x). If we then take logs, what we have is the sum of log(p(y)^freq(x)) over all letters y in the source alphabet -- and this sum is exactly what the maximum bipartite perfect matching algorithm will maximise.
There are algorithms that solve the closely related maximum-weight bipartite matching problem in O(V^2E) time, which amounts to O(k^4) time here. This version of the problem looks to maximise the total sum of all chosen pairs while obeying the constraint that no item is matched with two or more other items in the other set, but without the requirement that every item is matched with some other item. Fortunately, we can turn this algorithm into an algorithm that solves the "perfect" variation we want to solve very simply, by adding a large number M to each score: Intuitively, this causes the algorithm to try to add as many edges as possible (since omitting even a single edge has a large cost, making such a solution much worse than even a "stupid" solution that contains a full set of k edges), while still forcing the algorithm to choose the best among all the k-edge solutions (since all of these solutions include the same kM contribution from the added weight).
Upvotes: 2