Reputation: 317
I am planing to save big amount of sentences in a certain language such that every element has a pointer to the translation of that sentence to different languages.
By sorting the sentences I can create a balances binary tree and add/delete/search in O(logn).
Is there a way to save the elements in a hash table? What would the hash function look like in this situation?
I want to be able to search a sentence in any language and return the translation of it in any other language.
If I have K languages, using a balanced tree I get O(klogn) search time for k trees.
Is there a better way?
Upvotes: 1
Views: 90
Reputation: 10882
Yes, trie is better suitable for string search. Also I'd store all sentences in all languages in single trie, with value being the list of all available translations.
The search in this case will be O(k) where k
is the length of the searched sentence.
UPD If you are searching only for known language and translating to another known in advance language, then:
Make each the value of the trie Map<(Source Language) → Map<(Target language) → Translation>>
You'll have following workflow: Find value, that corresponds to the given sentence in the trie. In the found Map search for the source language. If found, this means that sentence is in target language (assuming there can be collisions between languages). At this point found Map (inner) represents all available translations to target languages. Search it for desired one.
The most performance-efficient way to implement both Maps will be plain array. Where i-th
element represent value for i-th
translation (assuming you have finite number of translations and you can assign distinct index to each of them). The search for known translation will be O(1) — just accessing known index in array. But this approach will also be most memory consuming, as if you have sparse values in the Map, your backing array will still consume fixed amount of memory (proportional to the number of all possible translations).
But this approach will work for the inner Map (where, as I understood, translation for all possible languages are stored).
For the outer map, which I can expect to be sparse, you have several options:
UPD2 Example:
Suppose you have next sentences that translates to each other: Hello [en] Hola [es] Hallo [nl] Hallo [de]
You should add all of them into trie, with following values:
Hello -> Map([en] -> Map([en]->Hello, [es]->Hola, [nl]->Hallo, [de]->Hallo))
Hola -> Map([es] -> Map([en]->Hello, [es]->Hola, [nl]->Hallo, [de]->Hallo))
Hallo -> Map([nl] -> Map([en]->Hello, [es]->Hola, [nl]->Hallo, [de]->Hallo),
[de] -> Map([en]->Hello, [es]->Hola, [nl]->Hallo, [de]->Hallo))
Square brackets here have nothing to do with arrays, they are used to visually distinguish languages.
So, when you perform search in you trie, you don't know on which language is your sentence. When you get a value from the trie (Map of Maps) you can "map" the original language of your sentence to the translation in desired language.
Back to example, if you have sentence "Hallo" and you know it's in [de], you want to translate it into [en]. You search "Hallo" in the trie, and get:
Map([nl] -> Map([en]->Hello, [es]->Hola, [nl]->Hallo, [de]->Hallo),
[de] -> Map([en]->Hello, [es]->Hola, [nl]->Hallo, [de]->Hallo))
Now you get value for [de] from this map (as you think your sentence in in [de]). You'll get:
Map([en]->Hello, [es]->Hola, [nl]->Hallo, [de]->Hallo)
Now fetch search for target language in this map ([en]). You'll get "Hello".
Of course instead of Trie->Map->Map structure you can have Map->Trie->Map, where first Map will map source languages to corresponding dictionaries, represented by trie. But I think this is less memory efficient, as it seems that words in different languages often share the same prefix, and hence having them in single trie will preserve memory. But it's the same in terms of performance efficiency, so it's up to you.
Upvotes: 1