mritz_p
mritz_p

Reputation: 3098

SOLR 4.0 alphabetical sorting trouble

I'm having a hard time of getting my head around an issue I have with my SOLR address database.

I built this one up from the example files. I'm basically running the example configuration with a modified schema.

schema.xml:

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="_version_" type="long" indexed="true" stored="true" required="false" multiValued="false" />

<field name="givenname_s" type="text_de" indexed="true" stored="true" required="true" multiValued="false" />
<field name="middleinitial_s" type="text_de" indexed="false" stored="true" required="false" multiValued="false" />
<field name="surname_s" type="text_de" indexed="true" stored="true" required="true" multiValued="false" />
<field name="gender_s" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="pictureuri_s" type="string" indexed="false" stored="true" required="false" multiValued="false" />
<field name="function_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="organizationalunit_s" type="text_general" indexed="true" stored="true" required="false" multiValued="false" />
<field name="organizationalunitdescription_s" type="text_de" indexed="false" stored="true" required="false" multiValued="false" />
<field name="company_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="street_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="streetnumber_s" type="int" indexed="true" stored="true" required="false" multiValued="false" />
<field name="postcode_s" type="int" indexed="true" stored="true" required="false" multiValued="false" />
<field name="city_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="building_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" />
<field name="roomnumber_s" type="int" indexed="true" stored="true" required="false" multiValued="false" />
<field name="country_s" type="text_en" indexed="true" stored="true" required="true" multiValued="false" />
<field name="countrycode_s" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="emailaddress_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="phone1_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="phone2_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="mobile_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />
<field name="fax_s" type="string" indexed="true" stored="true" required="false" multiValued="false" />

I am populating the database by pushing about 20.000 random test datasets like the following to post.jar:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<add>
    <doc>
        <field name="id">1352498443_1</field>
        <field name="givenname_s">Aynur</field>
        <field name="middleinitial_s"/>
        <field name="surname_s">Lehnen</field>
        <field name="gender_s">F</field>
        <field name="pictureuri_s">dummy_assets/female.jpg</field>
        <field name="function_s">Zugschaffner/in</field>
        <field name="organizationalunit_s">P 07</field>
        <field name="organizationalunitdescription_s">Lorem Ipsum sadipscing voluptua ipsum invidunt dolor et dolore invidunt sed consetetur accusam dolore Lorem tempor.</field>
        <field name="company_s">Lorem Lagna Epsum Emet</field>
        <field name="street_s">Erlenweg</field>
        <field name="streetnumber_s">82</field>
        <field name="postcode_s">76297</field>
        <field name="city_s">Lübeck</field>
        <field name="building_s"/>
        <field name="roomnumber_s">242</field>
        <field name="country_s">GERMANY</field>
        <field name="countrycode_s">DE</field>
        <field name="emailaddress_s">[email protected]</field>
        <field name="phone1_s">0392984823</field>
        <field name="phone2_s">0124111417</field>
        <field name="mobile_s">0325117132</field>
        <field name="fax_s">0171459177</field>
    </doc>
</add>

However when retreiving data I seem to have problems with alphabetical sorting. Consider the folowing query:

{
    "responseHeader": {
        "status": 0,
            "QTime": 5,
            "params": {
            "sort": "surname_s asc",
                "fl": "surname_s",
                "indent": "true",
                "wt": "json",
                "q": "city_s:berlin"
        }
    },
        "response": {
        "numFound": 1094,
        "start": 0,
        "docs": [{
            "surname_s": "Weil"
        }, {
            "surname_s": "Abel"
        }, {
            "surname_s": "Adam"
        }, {
            "surname_s": "Ade"
        }, {
            "surname_s": "Adrian"
        }, {
            "surname_s": "Aigner"
        }, {
            "surname_s": "Aigner"
        }, {
            "surname_s": "Alber"
        }, {
            "surname_s": "Alber"
        }, {
            "surname_s": "Albers"
        }]
    }
}

Why is "Weil" on position one, while the rest of the data appears to be sorted correctly?

Upvotes: 9

Views: 9018

Answers (3)

Maarten Kieft
Maarten Kieft

Reputation: 7125

I had similiar issues and I tried the alphaOnlySort. This work for some part, but it starts messing up the sort results when the field contains values like -,/ spaces etc.

So the result was something like

  1. / abc
  2. aa
  3. / abc2

So I ended up using the field type lowercase. It was already there so I figured that its a default type. I did use the copy field construction, so my final config was:

<schema>
    <fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
      </analyzer>
    </fieldType>
    <fields>
       <field name="job_name_sort" type="lowercase" indexed="true" stored="false" required="false"/>
    </fields>
    <copyField source="job_name" dest="job_name_sort"/>
</schema>

Upvotes: 0

Jayendra
Jayendra

Reputation: 52769

Sorting doesn't work well on multivalued and tokenized fields.

Documentation -
Sorting can be done on the "score" of the document, or on any multiValued="false" indexed="true" field provided that field is either non-tokenized (ie: has no Analyzer) or uses an Analyzer that only produces a single Term (ie: uses the KeywordTokenizer)

Use string as the field type and copy the title field into the new field.

<field name="surname_s_sort" type="string" indexed="true" stored="false"/>

<copyField source="surname_s" dest="surname_s_sort" />  

As @Paige answered you can have keyword tokenizer, lower case filters which do not tokenize the field.

Upvotes: 4

Paige Cook
Paige Cook

Reputation: 22555

I believe that some of the additional analyzers that are being applied in the text_de field type are the cause for this sorting behavior. In my experience, for the best results when sorting strings is to use the alphaOlySort fieldType that comes with the example schema.xml shown below.

<fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
  <analyzer>
    <!-- KeywordTokenizer does no actual tokenizing, so the entire
         input string is preserved as a single token
      -->
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <!-- The LowerCase TokenFilter does what you expect, which can be
         when you want your sorting to be case insensitive
      -->
    <filter class="solr.LowerCaseFilterFactory" />
    <!-- The TrimFilter removes any leading or trailing whitespace -->
    <filter class="solr.TrimFilterFactory" />
    <!-- The PatternReplaceFilter gives you the flexibility to use
         Java Regular expression to replace any sequence of characters
         matching a pattern with an arbitrary replacement string, 
         which may include back references to portions of the original
         string matched by the pattern.

         See the Java Regular Expression documentation for more
         information on pattern and replacement string syntax.

         http://java.sun.com/j2se/1.6.0/docs/api/java/util/regex/package-summary.html
      -->
    <filter class="solr.PatternReplaceFilterFactory"
            pattern="([^a-z])" replacement="" replace="all"
    />
  </analyzer>
</fieldType>

I would recommend creating a new field and then copying the value from surname_s via copyField, something like the following:

 <field name="surname_s_sort" type="alphaOnlySort" indexed="true" stored="false" required="false" multiValued="false" />

 <copyField source="surname_s" dest="surname_s_sort"/>

Note: there is not any need to store the value in the surname_s_sort field, hence the stored="false" attribute, unless you expect to display that to the users.

Then you can just change your query to sort on the surname_s_sort instead.

Upvotes: 14

Related Questions