Reputation: 525
I'm wondering how I can generate a random logical word list in PHP.
I have a MySQL database full of English words (A - Z) and I want to generate logical words to go with each one.
For example: In the word list I have, number 26 is 'abandon', I would like to generate a word for this word maybe using regex or something so I can translate a whole page of words back and forth using it.
The problem about using straight up random words is they don't look authentic enough, so 'abandon' might become (purely randomly generated) 'qdbskp' or something like that. The problem being the word doesn't look authentic at all, it really just looks like someone slammed their face into the keyboard.
However I would like some logic to it, so maybe a few vowels and consonants to make the word look "real".
Hopefully I'm explaining myself correctly.
Thanks.
TLDR: I'm trying to create a randomly generated word dictionary with links to an English word list that have some logic so the words look real.
Upvotes: 2
Views: 4885
Reputation: 1365
function random_word( $length = 6 ) {
$cons = array( 'b', 'c', 'd', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'p', 'r', 's', 't', 'v', 'w', 'x', 'z', 'pt', 'gl', 'gr', 'ch', 'ph', 'ps', 'sh', 'st', 'th', 'wh' );
$cons_cant_start = array( 'ck', 'cm', 'dr', 'ds','ft', 'gh', 'gn', 'kr', 'ks', 'ls', 'lt', 'lr', 'mp', 'mt', 'ms', 'ng', 'ns','rd', 'rg', 'rs', 'rt', 'ss', 'ts', 'tch');
$vows = array( 'a', 'e', 'i', 'o', 'u', 'y','ee', 'oa', 'oo');
$current = ( mt_rand( 0, 1 ) == '0' ? 'cons' : 'vows' );
$word = '';
while( strlen( $word ) < $length ) {
if( strlen( $word ) == 2 ) $cons = array_merge( $cons, $cons_cant_start );
$rnd = ${$current}[ mt_rand( 0, count( ${$current} ) -1 ) ];
if( strlen( $word . $rnd ) <= $length ) {
$word .= $rnd;
$current = ( $current == 'cons' ? 'vows' : 'cons' );
}
}
return $word;
}
Simple and it's working great, credits to http://ozh.in/vh
Upvotes: 1
Reputation: 26375
What can make a word look somewhat logical is if it's composed of characters in an order you're used to seeing them. One way to do this is with a weighted list of trigrams - sequences of 3 characters.
Basically you take any two letters, like "so", and add another that commonly comes after it, like "l". Then take the last two letters, "ol", and find what comes after that. Rinse/repeat until you've got a word of whatever length you'd like - "solverom".
Sourcing from Peter Norvig's n-gram data (which itself was compiled from Google books ngrams), I've put together a few json files on github. I'd include the data directly here, but trigrams.json in particular is a bit big for that at ~128KB.
The data can actually be compiled from any dictionary or other hulking word list, and is structured like so...
[0,26,622,4615,6977,10541,13341,14392,13284,11079,8468,5769,3700,2272,1202,668,283,158,64,40,16,1,5,2]
This one is complete. It is a (0-indexed) distribution of lengths of distinct words. Each index is the word length and each value how many words of that length were found. So, for example, there were 4615 distinct words that were 3 characters long.
We'll use this to decide how long our new word should be. Basically we add up all the values, pick a random number between 1 and the total, then find where in the set it lays. The key for that element is how long the word will be.
{
"TH": "82191954206",
"HE": "9112438473",
"IN": "27799770674",
"ER": "324230831",
...
This one couples bigrams, two-character combinations, with how often they're found at the beginning of words. Yes, everything is in capital letters.
We'll use this to decide what to start our word with.
{
"TH": {
"E": "69221160871",
"A": "9447439870",
"I": "6357454845",
"O": "3369505315",
"R": "1673179164",
...
},
"AN": {
"D": "26468697834",
"T": "3755591976",
"C": "3061152975",
...
This one is a little more interesting. Each key in this data set is a bigram with an array of characters and how often that character appears after it.
"D" shows up after "AN" a lot.
This is what we'll use to build up the rest of the word.
First we need a few utility functions.
function gmp_rand($min, $max) {
$max -= $min;
$bit_length = strlen(gmp_strval($max, 2));
do {
$rand = gmp_init(0);
for ($i = $bit_length - 1; $i >= 0; $i--) {
gmp_setbit($rand, $i, rand(0, 1));
if ($rand > $max) break;
}
} while ($rand > $max);
return $rand + $min;
}
Because some of the numbers we need to generate can be larger than PHP_INT_MAX
we'll use the PHP GMP extension to deal with them. Simple enough rand()
work-a-like.
function array_weighted_rand ($list) {
$total_weight = gmp_init(0);
foreach ($list as $weight) {
$total_weight += $weight;
}
$rand = gmp_rand(1, $total_weight);
foreach ($list as $key => $weight) {
$rand -= $weight;
if ($rand <= 0) return $key;
}
}
This is much like the built-in array_rand()
in that you pass it an array and it'll return a random key. Only this one factors in the weight when picking it.
So if you pass in an array that looks like:
array (
'foo' => 2,
'bar' => 4,
'baz' => 12
)
It'll return bar
about twice as often as it'll return foo
, and baz
about three times as often as bar
.
function fill_word ($word, $length, $trigrams) {
while (strlen($word) < $length) {
$word .= array_weighted_rand($trigrams[substr($word, -2)]);
}
return $word;
}
This takes a string $word
and fills it to $length
from the set of given $trigrams
. Each iteration it picks from the data set based on the last two characters in the string.
$lengths = json_decode(file_get_contents('distinct_word_lengths.json'), true);
$bigrams = json_decode(file_get_contents('word_start_bigrams.json'), true);
$trigrams = json_decode(file_get_contents('trigrams.json'), true);
for ($i = 0; $i < 10; $i++) {
do {
$length = array_weighted_rand($lengths);
$start = array_weighted_rand($bigrams);
$word = fill_word($start, $length, $trigrams);
} while (!preg_match('/[AEIOUY]/', $word));
$word = strtolower($word);
echo "$word\n";
}
What we're doing is getting a random length, and random bigram to begin the word with, then filling it up. The preg_match()
is just to validate that the word contains a vowel, which isn't otherwise guaranteed. If it doesn't, try again.
You can replace this with any sort of validation you might want to do, such as making sure it doesn't match a real word in your database or whatever.
Yeah, you might generate a real word. Just pronounce it different if you want to say you made it up.
Running a handful of times landed me with these:
ancover ingennized plesuri asymbablew
orkno oftedi nestrat arlysect
welvency thembe therespaid frokedgerition
judeth ist rectede privede
aprommautu offeleal townerislo callynerly
thentsi perma themenum agesputherflone
pecticangenti whoult ifileyea onster
flatco powne prative betion
inegansith meraddin theste mysistai
skerest uppre ongdonc hadmints
All of which my spell-checker hates.
Full data and code can be grabbed from github.
Upvotes: 15
Reputation: 525
I've made a lot of progress using a lot of the ideas suggested and have come up with a rather interesting system to generate words off of their English equals. I have made a function which generates words with a random 1 - 3 amount of consonants with a vowel at the end.
function generateRandomWord($length = false) {
$vowels = "aeiou";
$consonants = "bcdfghjklmnpqrstvwxyz";
$string = "";
if ($length == false) {
$length = rand(1, 3);
}
for ($i = 0; $i < $length; $i++) {
$ratio = rand(0, 3);
for ($a = 0; $a < $ratio; $a++) {
$string .= $consonants[rand(0, strlen($consonants) - 1)];
}
$string .= $vowels[rand(0, strlen($vowels) - 1)];
}
if (strlen($string) > $length) {
$string = substr($string, 0, $length);
}
return $string;
}
It also trims off the end of the string so the word is not too long.
Pressing refresh a couple of times and I get this:
aa ri
aah oeb
aal gyi
aalii cpwaa
aardvark qdiaieug
aardvarks jupuhuafs
aardwolf yaniruqk
aardwolves qtxikicoes
aargh yauka
aarrghh byifqsa
I found this to be quite interesting and I can populate a database of these generated words with their English translation.
This could make a pretty cool secret language which can be translated back and forth.
Upvotes: 1
Reputation: 209585
One option is to have a list of valid syllables and then simply combine these randomly, or keyed off the real word that you are using as the basis for the fake word (by some sort of mapping of real syllables to fake). If coming up with a list of valid syllables is too much work, or produces bad results, you could go to the next level: phonotactics. You'd have to develop a system that can concatenate sounds in a way that doesn't violate the rules of English. For example, it's okay to start a word with "bl" followed by a vowel, but not "bn" followed by a vowel (so you can have "black" but not *"bnack"). These rules probably can't all be expressed as "letter x can/cannot be followed by letter y", but most can, and perhaps that would be good enough for generating random fake, but plausible-sounding words.
Upvotes: 0
Reputation: 5919
Other than the comment I gave above, if you specifically wanted nonsense words, but still plausible, probably the easiest way is this:
find 2 words which have a number (what number that is may require experimentation) of letters in common (not at the start or end), and combine them - the start of one, and the end of the other.
For example, if you combine "experimENTation" and "mENThol", you would get "experimENThol". You should check the dictionary before using them (if they HAVE to be nonsense), or you might accidentally make a real word - e.g. combining "mENThol" and "experimENTation", you would would get "mENTation" - which is a real word.
Upvotes: 0