Data structure for saving whole sentences in several languages

Question

I am planing to save big amount of sentences in a certain language such that every element has a pointer to the translation of that sentence to different languages.
By sorting the sentences I can create a balances binary tree and add/delete/search in O(logn).
Is there a way to save the elements in a hash table? What would the hash function look like in this situation?
I want to be able to search a sentence in any language and return the translation of it in any other language.
If I have K languages, using a balanced tree I get O(klogn) search time for k trees.
Is there a better way?

Aivean · Accepted Answer

Yes, trie is better suitable for string search. Also I'd store all sentences in all languages in single trie, with value being the list of all available translations.

The search in this case will be O(k) where k is the length of the searched sentence.

UPD If you are searching only for known language and translating to another known in advance language, then:

Make each the value of the trie Map<(Source Language) → Map<(Target language) → Translation>>

You'll have following workflow: Find value, that corresponds to the given sentence in the trie. In the found Map search for the source language. If found, this means that sentence is in target language (assuming there can be collisions between languages). At this point found Map (inner) represents all available translations to target languages. Search it for desired one.

The most performance-efficient way to implement both Maps will be plain array. Where i-th element represent value for i-th translation (assuming you have finite number of translations and you can assign distinct index to each of them). The search for known translation will be O(1) — just accessing known index in array. But this approach will also be most memory consuming, as if you have sparse values in the Map, your backing array will still consume fixed amount of memory (proportional to the number of all possible translations).

But this approach will work for the inner Map (where, as I understood, translation for all possible languages are stored).

For the outer map, which I can expect to be sparse, you have several options:

If the average number of collisions between sentences in different languages is small (1-2), you can just store Key→Value pairs in the list. Average search time will be constant.
If the average number of collisions is significant, but still not big enough to go with array (approach for the inner Map), then back this Map by a Hashtable, which size will be the expected number of values in the Map — the average number of collisions. Then again you'll get amortized O(1) search time.
Map appears to be dense — go with array approach, as for inner Map.

UPD2 Example:

Suppose you have next sentences that translates to each other: Hello [en] Hola [es] Hallo [nl] Hallo [de]

You should add all of them into trie, with following values:

Hello -> Map([en] -> Map([en]->Hello, [es]->Hola, [nl]->Hallo, [de]->Hallo))
Hola -> Map([es] -> Map([en]->Hello, [es]->Hola, [nl]->Hallo, [de]->Hallo))
Hallo -> Map([nl] -> Map([en]->Hello, [es]->Hola, [nl]->Hallo, [de]->Hallo),
             [de] -> Map([en]->Hello, [es]->Hola, [nl]->Hallo, [de]->Hallo))

Square brackets here have nothing to do with arrays, they are used to visually distinguish languages.

So, when you perform search in you trie, you don't know on which language is your sentence. When you get a value from the trie (Map of Maps) you can "map" the original language of your sentence to the translation in desired language.

Back to example, if you have sentence "Hallo" and you know it's in [de], you want to translate it into [en]. You search "Hallo" in the trie, and get:

Map([nl] -> Map([en]->Hello, [es]->Hola, [nl]->Hallo, [de]->Hallo),
             [de] -> Map([en]->Hello, [es]->Hola, [nl]->Hallo, [de]->Hallo))

Now you get value for [de] from this map (as you think your sentence in in [de]). You'll get:

Map([en]->Hello, [es]->Hola, [nl]->Hallo, [de]->Hallo)

Now fetch search for target language in this map ([en]). You'll get "Hello".

Of course instead of Trie->Map->Map structure you can have Map->Trie->Map, where first Map will map source languages to corresponding dictionaries, represented by trie. But I think this is less memory efficient, as it seems that words in different languages often share the same prefix, and hence having them in single trie will preserve memory. But it's the same in terms of performance efficiency, so it's up to you.

Data structure for saving whole sentences in several languages

Answers (1)

Related Questions