Jonas
Jonas

Reputation: 159

machine-learning, artificial-intelligence and computational-linguistics

I would love to talk to people who have experience in machine-learning, computational-linguistics or artificial-intelligence in general but by the following example:

• Which existing software would you apply for a manageable attempt building something like google translate by statistic linguistic, machine learning? (Don’t get me wrong I don’t want to just do this, but solely trying to draw a conceptional framework for something most complex in this field, what would you think of if you had the chance to lead a team going to realize such...)

• Which existent database(s)? Which database technology to store results when those are terabytes of data

• Which programming languages besides C++?

• Apache mahunt?

• And, how would those software components work together to power the effort as a whole?

Upvotes: 0

Views: 2944

Answers (5)

PaoloTCS
PaoloTCS

Reputation: 21

Google's Tensorflow is a useful tool for basic translation. Anyone who is truly bilingual knows, however, that translating is not a statistical process. It is a much more complicated process that has just been simplified so that 90% of it seems correct.
Immense parallelism will make a great difference, so the advent of Quantum Computing, and maybe some of the ideas form it, will make possible the next 8%.
The final 2% will match normal professional translators and interpreters.

Upvotes: 0

Gael Varoquaux
Gael Varoquaux

Reputation: 2476

With regards to language choice, at least for prototyping, I would suggest Python. It is enjoying a lot of success in the natural language processing as comes with a large library of tools with scientific computing, text analysis, and machine learning. Last but not least, it is really easy to call compiled code (C, C++), if you want to benefit from existing tools.

Specifically, have a look at the following modules:

Olivier Grisel's presentation on text mining with these tools can come in handy.

Disclaimer: I am one of the core developers of scikits.learn.

Upvotes: 3

yura
yura

Reputation: 14645

Which existent database(s)? Which database technology to store results when those are terabytes of data HBase, ElasticSearch, MongoDB

• Which programming languages besides C++? For ML other popular languages Scala, Java, Python

• Apache mahunt? Useful sometimes, more codding to pure Hadoop

• And, how would those software components work together to power the effort as a whole? There are many statistical machine learning algorithms which can be paralelized with mapreduce, allow sotrage in NoSQl

Upvotes: 2

bmat
bmat

Reputation: 26

The best techniques available for automated translation are based on statistical methods. In computer science this is known as "Machine Translation" or MT. The idea is to treat the signal (the text to be translated) as a noisy signal and to use error correction to "fix" the signal. For example, suppose you are translating english to french. Assume the english statement was originally french but came out as english. You have to fix it up to restore it. A statistical language model can be built for the target language (french) and for the errors. Errors could include dropped words, moved words, misspelled words, and added words.

More can be found at : http://www.statmt.org/

Regarding the db, an MT solution does not need a typical db. Everything should be done in memory.

The best language to use for this specific task is the fastest one. C would be ideal for this problem because it is fast and easy to control memory access. But any high level language could be used such as Perl, C#, Java, Python, etc.

Upvotes: 1

Kiril
Kiril

Reputation: 40345

Which existing software would you apply for a manageable attempt building something like google translate by statisic linguistic, machine learning

If your only goal is to build software that translates, then I would just use the Google Language API: it's free so why reinvent the wheel? If your goal is to build a translator similar to Google's for the sake of getting familiar with machine learning, then you're on the wrong path... try a simpler problem.

Which database(s)?

Update:
Depends on the size of your corpus: if it's ginormous, then I would go with hadoop (since you mentioned mahout)... otherwise go with a standard database (SQL Server, MySQL, etc.).

Original:
I'm not sure what databases you can use for this, but if all else fails you can use Google translate to build your own database... however, the latter will introduce bias towards Google's translator and any errors that Google does will cause your software to (at the very least) have the same errors.

Which programming languages besides C++?

Whatever you're most comfortable with... certainly C++ is an option, but you might have an easier time with Java or C#. Developing in Java and C# is much faster since there is A LOT of functionality built into those languages right from the start.

Apache mahunt?

If you have an enormous data set... you could.

Update:
In general if the size of your corpus is really big, then I would definitely use a robust combination like mahout/hadoop. Both of them are built exactly for that purpose and you would have a really hard time "duplicating" all of their work unless you do have a huge team behind you.

And, how would those software components work together to power the effort as a whole?

It seems that you are in fact trying to familiarize yourself with machine learning... I would try something MUCH simpler: build a language detector instead of a translator. I recently built one and I found that the most useful thing you can do is build character n-grams (bigrams and trigrams combined worked the best). You would then use the n-grams as input to a standard Machine Learning algorithm (like C45, GP, GA, Bayesian Model, etc.) and perform 10-fold cross-validation to minimize overfitting.


Update:

"...what software components do you use to make your example running?"

My example was pretty simple: I have an SQL Server database with documents which are already labeled with a language, I load all the data in the memory (several hundred documents) and I give the algorithm (C45) each document. The algorithm uses a custom function to extract the document features (bigram and trigram letters), then it runs its standard learning process and spits out a model. I then test the model against a testing data set to verify the accuracy.

In your case, with terabytes of data, it seems that you should use mahout with hadoop. Additionally, the components you're going to be using are well defined in the mahout/hadoop architecture, so it should be pretty self explanatory from there on.

Upvotes: 3

Related Questions