empi
empi

Reputation: 15901

Removing diacritics in Polish

I'm trying to remove diacritic characters from a pangram in Polish. I'm using code from Michael Kaplan's blog http://www.siao2.com/2007/05/14/2629747.aspx, however, with no success.

Consider following pangram: "Pchnąć w tę łódź jeża lub ośm skrzyń fig.". Everything works fine but for letter "ł", I still get "ł". I guess the problem is that "ł" is represented as single unicode character and there is no following NonSpacingMark.

Do you have any idea how I can fix it (without relying on custom mapping in some dictionary - I'm looking for some kind of unicode conversion)?

Upvotes: 9

Views: 14635

Answers (9)

MKasprzyk
MKasprzyk

Reputation: 499

public static string ReplacePolishSigns(this string input) 
        => input.Replace("ą", "a")
            .Replace("ć", "c")
            .Replace("ę", "e")
            .Replace("ł", "l")
            .Replace("ń", "n")
            .Replace("ó", "o")
            .Replace("ś", "s")
            .Replace("ż", "z")
            .Replace("ź", "z");    

Upvotes: 1

A.Herman
A.Herman

Reputation: 1

Propose it. Works perfect.

private static Dictionary<string, string> NormalizeTable()
{
    return new Dictionary<string, string>()
    {
        {"ą", "a"},
        {"ć", "c"},
        {"ę", "e"},
        {"ł", "l"},
        {"ń", "n"},
        {"ó", "o"},
        {"ś", "s"},
        {"ź", "z"},
        {"ż", "z"},
    };
}

public static string Normalize(string original)
{
    if (original == null) return null;
    var lower = original.ToLower();
    var dictionary = NormalizeTable();
    foreach (var (key, value) in dictionary)
    {
        lower = lower.Replace(key, value);
    }
    return lower;
}

Upvotes: 0

ahaw
ahaw

Reputation: 221

I found solution which is handling also 'ł'

string RemoveDiacritics(string text)
    {
        var normalizedString = text.Normalize(NormalizationForm.FormD);
        var stringBuilder = new StringBuilder();

        foreach (var c in normalizedString)
        {
            var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
            if (unicodeCategory != UnicodeCategory.NonSpacingMark)
            {
                stringBuilder.Append(c);
            }
        }

        return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
    }

Upvotes: -1

sinnerinc
sinnerinc

Reputation: 342

Some time ago I've come across this solution, which seems to work fine:

    public static string RemoveDiacritics(this string s)
    {
        string asciiEquivalents = Encoding.ASCII.GetString(
                     Encoding.GetEncoding("Cyrillic").GetBytes(s)
                 );

        return asciiEquivalents;
    }

Upvotes: 8

Michal_R
Michal_R

Reputation: 269

Here is my quick implementation of Polish stoplist with normalization of Polish diacritics.

    class StopList
{
    private HashSet<String> set = new HashSet<String>();

    public void add(String word)
    {
        word = word.trim().toLowerCase();
        word = normalize(word);
        set.add(word);

    }

    public boolean contains(final String string)
    {
        return set.contains(string) || set.contains(normalize(string));
    }

    private char normalizeChar(final char c)
    {
        switch ( c)
        {
            case 'ą':
                return 'a';
            case 'ć':
                return 'c';
            case 'ę':
                return 'e';
            case 'ł':
                return 'l';
            case 'ń':
                return 'n';
            case 'ó':
                return 'o';
            case 'ś':
                return 's';
            case 'ż':
            case 'ź':
                return 'z';
        }
        return c;
    }

    private String normalize(final String word)
    {
        if (word == null || "".equals(word))
        {
            return word;
        }
        char[] charArray = word.toCharArray();
        char[] normalizedArray = new char[charArray.length];
        for (int i = 0; i < normalizedArray.length; i++)
        {
            normalizedArray[i] = normalizeChar(charArray[i]);
        }
        return new String(normalizedArray);
    }
}

I couldnt find any other solution in the Net. So maybe it will be helpful for someone (?)

Upvotes: 4

dan04
dan04

Reputation: 91189

You'll have to replace these manually (just like with ÆÐØÞßæðøþ in Latin-1).

Other people have had the same problem, so the Unicode Common Locale Data Repository has "Agreed to add a transliterator that does accent removal, even for overlaid accents." (Ticket #2884)

Upvotes: 2

Jon Hanna
Jon Hanna

Reputation: 113322

There are quite a few precomposed characters that have no meaningful decompositions.

(There are also a handful that could have reasonable decompositions that are prohibitted from such decomposition in most normalisation forms, as it would lead to differences between version, which would make them not really normalisation any more).

ł is one of these. IIRC it's also not possible to give a culture-neutral transcription to alphabets that don't use ł. I think Germans tend to transcribe it to w rather than l (or maybe it's someone else who does), which makes sense (it's not quite right sound either, but it's closer than l).

Upvotes: 1

Hans Passant
Hans Passant

Reputation: 942020

It is in the Unicode chart, codepoint \u0142. Scroll down to the description, "Latin small letter with stroke", it has no decomposition listed. Don't know anything about Polish, but it is common for a letter to have a distinguishing mark that makes it its own letter instead of a base one with a diacritic.

Upvotes: 2

Eric J.
Eric J.

Reputation: 150148

The approach taken in the article is to remove Mark, Nonspacing characters. Since as you correctly point out "ł" is not composed of two characters (one of which is Mark, Nonspacing) the behavior you see is expected.

I don't think that the structure of Unicode allows you to accomplish a fully automated remapping (the author of the article you reference reaches the same conclusion).

If you're just interested in Polish characters, at least the mapping is small and well-defined (see e.g. the bottom of http://www.biega.com/special-char.html). For the general case, I do no think an automated solution exists for characters that are not composed of a standard character plus a Mark, Nonspacing character.

Upvotes: 3

Related Questions