Alon Gubkin
Alon Gubkin

Reputation: 57119

How to detect the language of a string?

What's the best way to detect the language of a string?

Upvotes: 21

Views: 34265

Answers (11)

Russ Cam
Russ Cam

Reputation: 125488

Use SearchPioneer.Lingua, a port of the popular Lingua library to .NET

using Lingua;
using static Lingua.Language;

var detector = LanguageDetectorBuilder
    .FromLanguages(English, French, German, Spanish)
    .Build();

var detectedLanguage = detector.DetectLanguageOf("languages are awesome");
Assert.Equal(English, detectedLanguage);

It supports 79 languages, and there are

  • extensive accuracy tests (click on Accuracy Report) that show it to be more accurate than other available libraries
  • benchmarks that show it to generally be the fastest in low accuracy mode.

(if you're curious, you can clone to repo and easily run any of these yourself).

You can find more details in the introduction blog post for the library.

Full disclosure: I ported the library to .NET and wrote the introduction blog post.

Upvotes: 0

Alexander Gluschenko
Alexander Gluschenko

Reputation: 11

I have a Panlingo project that I made half a year ago for my own purposes. There are six wrappers for native binaries of CLD2, CLD3, FastText, MediaPipe, Whatlang & Lingua. Most of the packages run on Linux, Windows and macOS.

Some of these models are based on neural networks and some on classical N-grams and statistics. The results of the models are not identical and may be inaccurate on very short texts.

Supported languages:

  • CLD2: 83
  • CLD3: 107
  • Whatlang: 69
  • MediaPipe: 110
  • Lingua: 75
  • FastText: depends on model

Example for FastText:

using Panlingo.LanguageIdentification.FastText;

class Program
{
    static void Main()
    {
        using var fastText = new FastTextDetector();
        fastText.LoadDefaultModel();

        var predictions = fastText.Predict(
            text: "Привіт, як справи?", 
            count: 10
        );

        foreach (var prediction in predictions)
        {
            Console.WriteLine($"{prediction.Label}: {prediction.Probability}");
        }
    }
}

Example for MediaPipe:

using Panlingo.LanguageIdentification.MediaPipe;

class Program
{
    static void Main()
    {
        using var mediaPipe = new MediaPipeDetector(
            options: MediaPipeOptions.FromDefault()
        );

        var text = "Привіт, як справи?";

        var predictions = mediaPipe.PredictLanguages(text);

        foreach (var prediction in predictions)
        {
            Console.WriteLine(
                $"Language: {prediction.Language}, " +
                $"Probability: {prediction.Probability}"
            );
        }
    }
}

Upvotes: 1

f3lix
f3lix

Reputation: 29879

CLD3 (Compact Language Detector v3) library from Google's Chromium browser

You could wrap the CLD3 library, which is written in C++.

Upvotes: 4

NGambit
NGambit

Reputation: 1181

One alternative is to use 'Translator Text API' which is

... part of the Azure Cognitive Services API collection of machine learning and AI algorithms in the cloud, and is readily consumable in your development projects

Here's a quickstart guide on how to detect language from text using this API

Upvotes: 0

Ivan Akcheurov
Ivan Akcheurov

Reputation: 2361

Fast answer: NTextCat (NuGet, Online Demo)

Long answer:

Currently the best way seems to use classifiers trained to classify piece of text into one (or more) of languages from predefined set.

There is a Perl tool called TextCat. It has language models for 74 most popular languages. There is a huge number of ports of this tool into different programming languages.

There were no ports in .Net. So I have written one: NTextCat on GitHub.

It is pure .NET Framework DLL + command line interface to it. By default, it uses a profile of 14 languages.

Any feedback is very appreciated! New ideas and feature requests are welcomed too :)

Alternative is to use numerous online services (e.g. one from Google mentioned, detectlanguage.com, langid.net, etc.).

Upvotes: 27

Reg Edit
Reg Edit

Reputation: 6916

You may use the C# package for language identification from Microsoft Research:

This package implements several algorithms for language identification, and includes two sets of pre-compiled language profiles. One set covers 52 languages and was trained on Wikipedia (i.e. a well-written corpus); the other covers 26 languages and was constructed from Twitter (i.e. a highly colloquial corpus). The language identifiers are packaged up as a C# library, and be easily embedded into other C# projects.

Download the package from the above link.

Upvotes: 3

ariful islam
ariful islam

Reputation: 31

We can use Regex.IsMatch(text, "[\\uxxxx-\\uxxxx]+") to detect an specific language. Here xxxx is the 4 digit Unicode id of a character.
To detect Arabic:

bool isArabic = Regex.IsMatch(yourtext, @"[\u0600-\u06FF]+")

Upvotes: 3

Magnus Johansson
Magnus Johansson

Reputation: 28325

If the context of your code have internet access, you can try to use the Google API for language detection. http://code.google.com/apis/ajaxlanguage/documentation/

var text = "¿Dónde está el baño?";
google.language.detect(text, function(result) {
  if (!result.error) {
    var language = 'unknown';
    for (l in google.language.Languages) {
      if (google.language.Languages[l] == result.language) {
        language = l;
        break;
      }
    }
    var container = document.getElementById("detection");
    container.innerHTML = text + " is: " + language + "";
  }
});

And, since you are using c#, take a look at this article on how to call the API from c#.

UPDATE: That c# link is gone, here's a cached copy of the core of it:

string s = TextBoxTranslateEnglishToHebrew.Text;
string key = "YOUR GOOGLE AJAX API KEY";
GoogleLangaugeDetector detector =
   new GoogleLangaugeDetector(s, VERSION.ONE_POINT_ZERO, key);

GoogleTranslator gTranslator = new GoogleTranslator(s, VERSION.ONE_POINT_ZERO,
   detector.LanguageDetected.Equals("iw") ? LANGUAGE.HEBREW : LANGUAGE.ENGLISH,
   detector.LanguageDetected.Equals("iw") ? LANGUAGE.ENGLISH : LANGUAGE.HEBREW,
   key);

TextBoxTranslation.Text = gTranslator.Translation;

Basically, you need to create a URI and send it to Google that looks like:

http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&q=hello%20worled&langpair=en%7ciw&key=your_google_api_key_goes_here

This tells the API that you want to translate "hello world" from English to Hebrew, to which Google's JSON response would look like:

{"responseData": {"translatedText":"שלום העולם"}, "responseDetails": null, "responseStatus": 200}

I chose to make a base class that represents a typical Google JSON response:

[Serializable]
public class JSONResponse
{
   public string responseDetails = null;
   public string responseStatus = null;
}

Then, a Translation object that inherits from this class:

[Serializable]
public class Translation: JSONResponse
{
   public TranslationResponseData responseData = 
    new TranslationResponseData();
}

This Translation class has a TranslationResponseData object that looks like this:

[Serializable]
public class TranslationResponseData
{
   public string translatedText;
}

Finally, we can make the GoogleTranslator class:

using System;
using System.Collections.Generic;
using System.Text;

using System.Web;
using System.Net;
using System.IO;
using System.Runtime.Serialization.Json;

namespace GoogleTranslationAPI
{

   public class GoogleTranslator
   {
      private string _q = "";
      private string _v = "";
      private string _key = "";
      private string _langPair = "";
      private string _requestUrl = "";
      private string _translation = "";

      public GoogleTranslator(string queryTerm, VERSION version, LANGUAGE languageFrom,
         LANGUAGE languageTo, string key)
      {
         _q = HttpUtility.UrlPathEncode(queryTerm);
         _v = HttpUtility.UrlEncode(EnumStringUtil.GetStringValue(version));
         _langPair =
            HttpUtility.UrlEncode(EnumStringUtil.GetStringValue(languageFrom) +
            "|" + EnumStringUtil.GetStringValue(languageTo));
         _key = HttpUtility.UrlEncode(key);

         string encodedRequestUrlFragment =
            string.Format("?v={0}&q={1}&langpair={2}&key={3}",
            _v, _q, _langPair, _key);

         _requestUrl = EnumStringUtil.GetStringValue(BASEURL.TRANSLATE) + encodedRequestUrlFragment;

         GetTranslation();
      }

      public string Translation
      {
         get { return _translation; }
         private set { _translation = value; }
      }

      private void GetTranslation()
      {
         try
         {
            WebRequest request = WebRequest.Create(_requestUrl);
            WebResponse response = request.GetResponse();

            StreamReader reader = new StreamReader(response.GetResponseStream());
            string json = reader.ReadLine();
            using (MemoryStream ms = new MemoryStream(Encoding.Unicode.GetBytes(json)))
            {
               DataContractJsonSerializer ser =
                  new DataContractJsonSerializer(typeof(Translation));
               Translation translation = ser.ReadObject(ms) as Translation;

               _translation = translation.responseData.translatedText;
            }
         }
         catch (Exception) { }
      }
   }
}

Upvotes: 33

GvS
GvS

Reputation: 52518

Make a statistical analyses of the string: Split the string into words. Get a dictionary for every language you want to test for. And then find the language that has the highest word count.

In C# every string in memory will be unicode, and is not encoded. Also in text files the encoding is not stored. (Sometimes only an indication of 8-bit or 16-bit).

If you want to make a distinction between two languages, you might find some simple tricks. For example if you want to recognize English from Dutch, the string that contains the "y" is mostly English. (Unreliable but fast).

Upvotes: 6

Greg Hewgill
Greg Hewgill

Reputation: 992707

A statistical approach using digraphs or trigraphs is a very good indicator. For example, here are the most common digraphs in English in order: http://www.letterfrequency.org/#digraph-frequency (one can find better or more complete lists). This method may have a better success rate than word analysis for short snippets of text because there are more digraphs in text than there are complete words.

Upvotes: 8

AakashM
AakashM

Reputation: 63340

If you mean the natural (ie human) language, this is in general a Hard Problem. What language is "server" - English or Turkish? What language is "chat" - English or French? What language is "uno" - Italian or Spanish (or Latin!) ?

Without paying attention to context, and doing some hard natural language processing (<----- this is the phrase to google for) you haven't got a chance.

You might enjoy a look at Frengly - it's a nice UI onto the Google Translate service which attempts to guess the language of the input text...

Upvotes: 6

Related Questions