Reputation: 6504

how to perform search on Arabic text in JAVA?

I have Arabic text in database with diacritics. when i type Arabic for searching some string, it is without diacritics which definitely do not match with database string. it is working fine on text without diacritics. is there any way to run it on text with diacritics ???

Upvotes: 11

Answers (6)

Arshad

Reputation: 965

Please see below class i created It is for android, return spannable String. It is so basic and did not bother about memory consumption. You guys can optimise yourself.

~~http://freshinfresh.com/sample/ABHArabicDiacritics.java~~

https://gist.github.com/alierdogan7/11f9cfb24f5551c34191485fc764d4c0

If you want to check without nunation(harakath) contains in an Arabic String,

    ABHArabicDiacritics objSearchd = new ABHArabicDiacritics();
objSearchdobjSearch.getDiacriticinsensitive("وَ اَشْهَدُ اَنْ لا اِلهَ اِلاَّ اللَّهُ").contains("اشهد");

If you want to return Highlighed or redColored searched portion in String. Use below code

ABHArabicDiacritics objSearch = new ABHArabicDiacritics( وَ اَشْهَدُ اَنْ لا اِلهَ اِلاَّ اللَّهُ, اشهد);
SpannableString spoutput=objSearch.getSearchHighlightedSpan();
            textView.setText(spoutput);

To see start and end position of search text, Use below methods,

 /** to serch Contains */
            objSearch.isContain();//
            objSearch.getSearchHighlightedSpan();
            objSearch.getSearchTextStartPosition();
            objSearch.getSearchTextEndPosition();

Please copy shared java class and enjoy.

I will spend more time for more feature if you guys request.

Thanks

Upvotes: -1

Zain

Reputation: 40830

Hope not to be late to the party, my issue is a little bit different than the OP, I wanted to search for Arabic text with diacritics and wanted to mark the searched text with some color, so I need to save the indices of the matched text.

The issue is that normalizing the text without diacritics will reduce the text length, and will get different indices of the searched text.

So, got that solved by using regex and SpannableString

/*
 * input: input text with Arabic Diacritics Or Letters that you want to ignore while searching
 * searchedWord: the word/text that you want to search in @input text
 * color: used to return a the founded matches with a different Foreground color using a SpannableString
 * */
public static Spannable searchArabicWithIgnoredDiacriticsOrLetters(String input, String searchedWord, int color) {

    Spannable output = new SpannableString(replaceLetters(input));
    StringBuilder sb = new StringBuilder();
    for (char ch : replaceLetters(searchedWord).toCharArray()) {
        sb.append(ch);
        sb.append("[\\u0655\\u0654\\u0670\\u065F\\u065E\\u065D\\u065C\\u065B\\u065A\\u0659\\u0658\\u0657\\u0656\\u06EC\\u06EB\\u06EA\\u06E4\\u061A\\u0619\\u0618\\u0617\\u0616\\u0615\\u064B\\u064C\\u064D\\u064E\\u064F\\u0650\\u0651\\u0652\\u0653\\u06DA\\u06D6\\u06D7\\u06D8\\u06D9\\u06DB\\u06DC\\u06DF\\u06E0\\u06E1\\u06E2\\u06E3\\u06E5\\u06E6\\u06E7\\u06E8\\u06EB\\u06EC\\u06ED]*");
    }

    Pattern pattern = Pattern.compile(String.valueOf(sb)); // get Pattern of the Regex
    Matcher matcher = pattern.matcher(input);  // get Matcher of the Pattern Regex in the input text
    while (matcher.find())
        output.setSpan(new ForegroundColorSpan(color),
                matcher.start(), matcher.end(), Spannable.SPAN_EXCLUSIVE_EXCLUSIVE);
    return output;
}

public static String replaceLetters(String input) {
    String output;
    output = input.replaceAll("أ", "ا");
    output = output.replaceAll("إ", "ا");
    output = output.replaceAll("ى", "ي");
    output = output.replaceAll("ة", "ه");
    output = output.replaceAll("آ", "ا");
    output = output.replaceAll("ٱ", "ا");
    return output;
}

Another representation of replaceLetters()

public static String replaceLetters(String input) {
    String output;

    output = input.replaceAll("\\u0623", String.valueOf((char) Integer.parseInt("0627", 16)));  // replace أ with ا
    output = output.replaceAll("\\u0625", String.valueOf((char) Integer.parseInt("0627", 16))); // replace إ with ا
    output = output.replaceAll("\\u0649", String.valueOf((char) Integer.parseInt("064A", 16))); // replace ي with ى
    output = output.replaceAll("\\u0629", String.valueOf((char) Integer.parseInt("0647", 16))); // replace ة with ه
    output = output.replaceAll("\\u0622", String.valueOf((char) Integer.parseInt("0627", 16))); // replace آ with ا
    output = output.replaceAll("\\u0671", String.valueOf((char) Integer.parseInt("0627", 16))); // replace ٱ with ا

    return output;
}

Note: you can refer to the accepted answer for the Unicode representation.

Upvotes: 0

Moro

Reputation: 2128

String targetWord = "الذين"
String text = "صِرَاطَ الَّذِينَ أَنْعَمْتَ عَلَيْهِمْ غَيْرِ الْمَغْضُوبِ عَلَيْهِمْ وَلَا الضَّالِّين";

char[] input = targetWord.toCharArray();
StringBuilder regex = new StringBuilder("");
for(char c : input) {
   regex.append(c);
   regex.append("(\\p{M})*");
}

Pattern searchRegEx = Pattern.compile(regex.toString());
Matcher m = searchRegEx.matcher(text);

if(m.find()){
   i = m.start();
   System.out.println("m.group(0):: " + i + " : " + m.group(0));
}

Upvotes: 0

n-oma-d

Reputation: 95

I found much better to do that. All rewards to joop for this:

import java.text.Normalizer;
import java.text.Normalizer.Form;

/**
 *
 * @author Ibbtek <http://ibbtek.altervista.org/>
 */
public class ArabicDiacritics {

    private String input;
    private final String output;

    /**
     * ArabicDiacritics constructor
     * @param input String
     */
    public ArabicDiacritics(String input){
        this.input=input;
        this.output=normalize();
    }

    /**
     * normalize Method
     * @return String
     */
    private String normalize(){

        input = Normalizer.normalize(input, Form.NFKD)
                .replaceAll("\\p{M}", "");

        return input;
    }

    /**
     * @return the output
     */
    public String getOutput() {
        return output;
    }

    public static void main(String[] args) {
        String test = "كَلَّا لَا تُطِعْهُ وَاسْجُدْ وَاقْتَرِبْ ۩";
        System.out.println("Before: "+test);
        test=new ArabicDiacritics(test).getOutput();
        System.out.println("After: "+test);
    }
}

Upvotes: 5

Ibrabel

Reputation: 394

is there any way to run it on text with diacritics ???

Unfortunately no. Like MIE said:

Arabic diacritics are characters

so it's not really possible as far as I know.

MIE's answer will be difficult to implement and will be simply impossible to get update if you change anything in your database.

You can maybe look at the Apache Lucene search software Library. I'm not sure but it looks like it can solve your problem.

Or you'll need to take off all the diacritics from your database and then you'll be able to query it with or without diacritics simply by using a small Arabic Normalizer like this one:

/**
 * ArabicNormalizer class
 * @author Ibrabel
 */
public final class ArabicNormalizer {

    private String input;
    private final String output;

    /**
     * ArabicNormalizer constructor
     * @param input String
     */
    public ArabicNormalizer(String input){
        this.input=input;
        this.output=normalize();
    }

    /**
     * normalize Method
     * @return String
     */
    private String normalize(){

        //Remove honorific sign
        input=input.replaceAll("\u0610", "");//ARABIC SIGN SALLALLAHOU ALAYHE WA SALLAM
        input=input.replaceAll("\u0611", "");//ARABIC SIGN ALAYHE ASSALLAM
        input=input.replaceAll("\u0612", "");//ARABIC SIGN RAHMATULLAH ALAYHE
        input=input.replaceAll("\u0613", "");//ARABIC SIGN RADI ALLAHOU ANHU
        input=input.replaceAll("\u0614", "");//ARABIC SIGN TAKHALLUS

        //Remove koranic anotation
        input=input.replaceAll("\u0615", "");//ARABIC SMALL HIGH TAH
        input=input.replaceAll("\u0616", "");//ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH
        input=input.replaceAll("\u0617", "");//ARABIC SMALL HIGH ZAIN
        input=input.replaceAll("\u0618", "");//ARABIC SMALL FATHA
        input=input.replaceAll("\u0619", "");//ARABIC SMALL DAMMA
        input=input.replaceAll("\u061A", "");//ARABIC SMALL KASRA
        input=input.replaceAll("\u06D6", "");//ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF MAKSURA
        input=input.replaceAll("\u06D7", "");//ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF MAKSURA
        input=input.replaceAll("\u06D8", "");//ARABIC SMALL HIGH MEEM INITIAL FORM
        input=input.replaceAll("\u06D9", "");//ARABIC SMALL HIGH LAM ALEF
        input=input.replaceAll("\u06DA", "");//ARABIC SMALL HIGH JEEM
        input=input.replaceAll("\u06DB", "");//ARABIC SMALL HIGH THREE DOTS
        input=input.replaceAll("\u06DC", "");//ARABIC SMALL HIGH SEEN
        input=input.replaceAll("\u06DD", "");//ARABIC END OF AYAH
        input=input.replaceAll("\u06DE", "");//ARABIC START OF RUB EL HIZB
        input=input.replaceAll("\u06DF", "");//ARABIC SMALL HIGH ROUNDED ZERO
        input=input.replaceAll("\u06E0", "");//ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO
        input=input.replaceAll("\u06E1", "");//ARABIC SMALL HIGH DOTLESS HEAD OF KHAH
        input=input.replaceAll("\u06E2", "");//ARABIC SMALL HIGH MEEM ISOLATED FORM
        input=input.replaceAll("\u06E3", "");//ARABIC SMALL LOW SEEN
        input=input.replaceAll("\u06E4", "");//ARABIC SMALL HIGH MADDA
        input=input.replaceAll("\u06E5", "");//ARABIC SMALL WAW
        input=input.replaceAll("\u06E6", "");//ARABIC SMALL YEH
        input=input.replaceAll("\u06E7", "");//ARABIC SMALL HIGH YEH
        input=input.replaceAll("\u06E8", "");//ARABIC SMALL HIGH NOON
        input=input.replaceAll("\u06E9", "");//ARABIC PLACE OF SAJDAH
        input=input.replaceAll("\u06EA", "");//ARABIC EMPTY CENTRE LOW STOP
        input=input.replaceAll("\u06EB", "");//ARABIC EMPTY CENTRE HIGH STOP
        input=input.replaceAll("\u06EC", "");//ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
        input=input.replaceAll("\u06ED", "");//ARABIC SMALL LOW MEEM

        //Remove tatweel
        input=input.replaceAll("\u0640", "");

        //Remove tashkeel
        input=input.replaceAll("\u064B", "");//ARABIC FATHATAN
        input=input.replaceAll("\u064C", "");//ARABIC DAMMATAN
        input=input.replaceAll("\u064D", "");//ARABIC KASRATAN
        input=input.replaceAll("\u064E", "");//ARABIC FATHA
        input=input.replaceAll("\u064F", "");//ARABIC DAMMA
        input=input.replaceAll("\u0650", "");//ARABIC KASRA
        input=input.replaceAll("\u0651", "");//ARABIC SHADDA
        input=input.replaceAll("\u0652", "");//ARABIC SUKUN
        input=input.replaceAll("\u0653", "");//ARABIC MADDAH ABOVE
        input=input.replaceAll("\u0654", "");//ARABIC HAMZA ABOVE
        input=input.replaceAll("\u0655", "");//ARABIC HAMZA BELOW
        input=input.replaceAll("\u0656", "");//ARABIC SUBSCRIPT ALEF
        input=input.replaceAll("\u0657", "");//ARABIC INVERTED DAMMA
        input=input.replaceAll("\u0658", "");//ARABIC MARK NOON GHUNNA
        input=input.replaceAll("\u0659", "");//ARABIC ZWARAKAY
        input=input.replaceAll("\u065A", "");//ARABIC VOWEL SIGN SMALL V ABOVE
        input=input.replaceAll("\u065B", "");//ARABIC VOWEL SIGN INVERTED SMALL V ABOVE
        input=input.replaceAll("\u065C", "");//ARABIC VOWEL SIGN DOT BELOW
        input=input.replaceAll("\u065D", "");//ARABIC REVERSED DAMMA
        input=input.replaceAll("\u065E", "");//ARABIC FATHA WITH TWO DOTS
        input=input.replaceAll("\u065F", "");//ARABIC WAVY HAMZA BELOW
        input=input.replaceAll("\u0670", "");//ARABIC LETTER SUPERSCRIPT ALEF

        //Replace Waw Hamza Above by Waw
        input=input.replaceAll("\u0624", "\u0648");

        //Replace Ta Marbuta by Ha
        input=input.replaceAll("\u0629", "\u0647");

        //Replace Ya
        // and Ya Hamza Above by Alif Maksura
        input=input.replaceAll("\u064A", "\u0649");
        input=input.replaceAll("\u0626", "\u0649");

        // Replace Alifs with Hamza Above/Below
        // and with Madda Above by Alif
        input=input.replaceAll("\u0622", "\u0627");
        input=input.replaceAll("\u0623", "\u0627");
        input=input.replaceAll("\u0625", "\u0627");

        return input;
    }

    /**
     * @return the output
     */
    public String getOutput() {
        return output;
    }

    public static void main(String[] args) {
        String test = "كَلَّا لَا تُطِعْهُ وَاسْجُدْ وَاقْتَرِبْ ۩";
        System.out.println("Before: "+test);
        test=new ArabicNormalizer(test).getOutput();
        System.out.println("After: "+test);
    }
}

Upvotes: 16

MIE

Reputation: 454

Arabic diacritics are characters so you can use like clause like this:

SELECT * FROM table WHERE name LIKE 'a[cd]*b[cd]*'

this will find 'ab' with any number of c or d between them.

you could do so by adding all arabic diacritics between square brackets after every letter

here you can find all of them with their unicode code point.

Upvotes: 3

how to perform search on Arabic text in JAVA?

Answers (6)

Related Questions