kc.kc.kc
kc.kc.kc

Reputation: 67

Python: What is the best way to determine the language of a text based on each letter's frequency?

I am writing a function in Python 2 that returns the language of a string based on its letter frequencies.

I am using the table named "Relative frequencies of letters in other languages" from Wikipedia. (https://en.wikipedia.org/wiki/Letter_frequency)

I have already determined the frequencies of each letter in the given text, which is in the form of a dictionary where the values represent (the occurrences of the key / the total number of letters).

{'a': 0.2, 'b': 0.05, 'c': 0.01, ...} 

I have also converted the table into a dictionary of dictionaries,

{'a': {'English': 0.08167, 'French': 0.07363, ...}, 'b': {'English': 0.01492, 'French': 0.0901, ...}, ...}

What are some good processes of comparing these values to determine the language based on the frequencies?

Solved - here's the updated code:

# freq_reference is a dictionary with structure {'English': {'a': freq, 'b': freq, ...}, 'French': {'a': freq, 'b': freq, ...}}
# freq is a dictionary with key = letter, and value = frequency of the letter that appears in the input text

# Manhattan
dis_man = {}
for lang in freq_reference:
    dis_man[lang] = 0.0
    for key in freq_reference[lang]:
        dis_man[lang] += abs(freq_reference[lang][key] - freq[key])

# Euclidean
dis_euc = {}
for lang in freq_reference:
    sum = 0.0
    for key in freq_reference[lang]:
        sum += (freq_reference[lang][key] - freq[key])**2
    dis_euc[lang] = sum**(1/2.0)

# find the lang with minimum Manhattan dis
min_man = 100
for lang in dis_man:
    if dis_man[lang] < min_man:
        min_man = dis_man[lang]
        min_lang_man = lang

# find the lang with minimum Eucliedian dis
min_euc = 100
for lang in dis_euc:
    if dis_euc[lang] < min_euc:
        min_euc = dis_euc[lang]
        min_lang_euc = lang

Upvotes: 1

Views: 632

Answers (1)

Code-Apprentice
Code-Apprentice

Reputation: 83537

I think a dictionary structured as {'English': {'a': ..., 'b': ..., ... }, 'French': {...}, ...} makes more sense for two reasons:

  1. You can immediately get a dictionary with the exact same structure as your frequency dictionary for the sample text.

  2. Each language can have different sets of characters.

Once you do this, a good place to start is by calculating the "distance" between your sample frequencies and the frequencies for each language. There are several "distance" metrics, including Manhattan distance and Euclidean distance. Try several of these to get multiple data points for measuring "closeness".

Upvotes: 1

Related Questions