Reputation: 31

What are the best algorithms to determine the language of text and to correct typos in python?

I am looking for algorithms that could tell the language of the text to me(e.g. Hello - English, Bonjour - French, Servicio - Spanish) and also correct typos of the words in english. I have already explored Google's TextBlob, it is very relevant but it got "Too many requests" error as soon as my code starts executing. I also started exploring Polyglot but I am facing a lot of issues to download the library on Windows.

Code for TextBlob

*import pandas as pd
from tkinter import filedialog
from textblob import TextBlob
import time
from time import sleep
colnames = ['Word']
x=filedialog.askopenfilename(title='Select the word list')
print("Data to be checked: " + x)
df = pd.read_excel(x,sheet_name='Sheet1',header=0,names=colnames,na_values='?',dtype=str)
words = df['Word']
i=0
Language_detector=pd.DataFrame(columns=['Word','Language','corrected_word','translated_word'])
for word in words:

        b = TextBlob(word)
        language_word=b.detect_language()
        time.sleep(0.5)

        if language_word in ['en','EN']:
            corrected_word=b.correct()
            time.sleep(0.5)
            Language_detector.loc[i, ['corrected_word']]=corrected_word
        else:
             translated_word=b.translate(to='en')
             time.sleep(0.5)

        Language_detector.loc[i, ['Word']]=word
        Language_detector.loc[i, ['Language']]=language_word
        Language_detector.loc[i, ['translated_word']]=translated_word

        i=i+1

filename="Language detector test v 1.xlsx"
Language_detector.to_excel(filename,sheet_name='Sheet1')
print("Languages identified for the word list")**

Upvotes: 1

Answers (3)

Mario

Reputation: 2621

For typo correction, you could use the Levenshtein algorithm, which computes a «edit distance». You can compare your words against a dictionary and choose the most likely word. For Python, you could use: https://pypi.org/project/python-Levenshtein/

See the concept of Levenshtein edit distance here: https://en.wikipedia.org/wiki/Levenshtein_distance

Upvotes: 0

Raymond Hettinger

Reputation: 226346

A common way to classify languages is to gather summary statistics on letter or word frequencies and compare them to a known corpus. A naive bayesian classifier would suffice. See https://pypi.org/project/Reverend/ for a way to do this in Python.

Correction of typos can also be done from a corpus using a statistical model of the most likely words versus the likelihood of a particular typo. See, https://norvig.com/spell-correct.html for an example of how to do this in Python.

Upvotes: 1

CLpragmatics

Reputation: 645

You could use this, but it is hardly reliable:

https://github.com/hb20007/hands-on-nltk-tutorial/blob/master/8-1-The-langdetect-and-langid-Libraries.ipynb

Alternatively, you could give compact language detector (cld v3) or fasttext a chance OR you could use a corpus to check frequencies of occurring words with the target text in order to find out whether the target text belongs to the language of the respective corpus. The latter is only possible if you know the set of languages to choose from.

Upvotes: 0

What are the best algorithms to determine the language of text and to correct typos in python?

Answers (3)

Related Questions