Amr Lotfy
Amr Lotfy

Reputation: 2997

Solr Indexing of Arabic content (with diacritics)

Each document consists of 3 fields, two fields are integers and the third is an arabic text with diacritics, the user may use words with/without diacritics or even some letters could have diacritics and others without diacritics, I can't find a schema.xml that helps in such a situation.

my schema.xml is now as follows:

<?xml version="1.0" encoding="UTF-8" ?>

<schema name="quran" version="1.5">

<fields>
    <field name="_version_" type="long" indexed="true" stored="true"/>
    <field name="_id" type="long" indexed="true" stored="true" />
    <field name="sura_number" type="int" indexed="true" stored="true" />
    <field name="verse_number" type="int" indexed="true" stored="true" />
    <field name="verse_text" type="text_ar" indexed="true" stored="true"/>
 </fields>



<types>
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />
    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>

    <fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="float" class="solr.TrieFloatField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0"/>

    <fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>

    <fieldType name="location" class="solr.LatLonType" subFieldSuffix="_coordinate"/>
    <dynamicField name="*_coordinate"  type="tdouble" indexed="true"  stored="false"/>

   <!--  Arabic  -->
   <fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
   <tokenizer class="solr.StandardTokenizerFactory"/>
   <!--  normalizes ﻯ to ﻱ, etc  -->
   <filter class="solr.ArabicNormalizationFilterFactory"/>
   <filter class="solr.ArabicStemFilterFactory"/>
   </analyzer>
   </fieldType>


</types>

</schema>

I also need synonyms.txt for arabic.

Upvotes: 0

Views: 446

Answers (2)

Ramzi Alqrainy
Ramzi Alqrainy

Reputation: 21

What do you think to use the configuration in schema.xml [Slide 18]?

Arabic Content with Apache Solr

Upvotes: 2

Alexandre Rafalovitch
Alexandre Rafalovitch

Reputation: 9789

You want to use ICUTransformFilterFactory. It's a little hard to understand but if you follow the link to the Filter itself and then to the ICU user guide, you will find a lot of information.

Some of it is quite hard to understand, so you may find the example I built for Thai language useful as a starting point.

Upvotes: 0

Related Questions