Gayolomao
Gayolomao

Reputation: 596

Solr not detecting language automatically

I've set up a single core solr (4.6.0) and I'm trying to index documents in multiple languages. I configured solr in a way to auto-detect the document language, but it always sets the default language (configured in langid.fallback parameter).

This is what I wrote in solrconfig.xml to allow language detection:

<requestHandler name="/update" class="solr.UpdateRequestHandler">
     <lst name="defaults">
       <str name="update.chain">langid</str>
     </lst>
  </requestHandler>

and

<updateRequestProcessorChain name="langid">
       <processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
         <str name="langid.fl">text,title,description,content</str>
         <str name="langid.langField">language_s</str>
         <str name="langid.fallback">en</str>
       </processor>
       <processor class="solr.LogUpdateProcessorFactory" />
       <processor class="solr.RunUpdateProcessorFactory" />
     </updateRequestProcessorChain>

After uploading a document, here it is what appears in the log:

248638 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – LangId configured
248639 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Language fallback to value en
248639 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Appending field text
248639 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Appending field title
248639 [qtp723484867-14] WARN  org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Field title not a String value, not including in detection
248640 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Appending field description
248640 [qtp723484867-14] WARN  org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Field description not a String value, not including in detection
248640 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Appending field content
248640 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – No input text to detect language from, returning empty list
248641 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – No language detected, using fallback en
248641 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Detected main document language from fields [Ljava.lang.String;@6efbb783: en

From my understanding, LanguageIdentifierUpdateProcessor can't process solr.TextField fields for language detecttion, but I haven't seen this restriction in any documentation. Furthermore, I've seen a couple of examples in books and both of them use text fields (not String fields) for language detection. And, I don't know why, but fields text and content are not taken into account.

Can anybody point me in the right direction?

Here there are the field definition of those fields:

<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>

Thanks!

Upvotes: 2

Views: 2230

Answers (4)

Hybos
Hybos

Reputation: 156

When updatting you should use

/update?update.chain=langid

and if it's properly configured will works.

Upvotes: 0

zorze
zorze

Reputation: 198

In SolR 7.1 onwards, 1) Uncomment <updateRequestProcessorChain name="langid"> section with the other parameters you wish. 2) Add the entry - langid to

  <initParams path="/update/**,/query,/select,/tvrh,/elevate,/spell,/browse">
    <lst name="defaults">
      <str name="df">_text_</str>
      <str name="update.chain">langid</str>

    </lst>
  </initParams>

3) Restart solr and use the standard pysolr as in:

solrTargetCollection = pysolr.Solr('http://localhost:8983/solr/LangCollection', timeout=10) solrTargetCollection.add([dataTFText]) solrTargetCollection.commit()

Upvotes: 1

Magic_Cindy
Magic_Cindy

Reputation: 3

I'm using 6.1.0, actually they made /update works, and /update/extract doesn't work anymore.

<requestHandler name="/update" class="solr.UpdateRequestHandler">
     <lst name="defaults">
       <str name="update.chain">langid</str>
     </lst>
  </requestHandler>

Upvotes: 0

Gayolomao
Gayolomao

Reputation: 596

I managed it by calling /update/extract.

In solrconfig.xml:

<!-- Solr Cell Update Request Handler
     http://wiki.apache.org/solr/ExtractingRequestHandler 
-->
<requestHandler name="/update/extract" 
                startup="lazy"
                class="solr.extraction.ExtractingRequestHandler" >
  <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">ignored_</str>

    <!-- capture link hrefs but ignore div attributes -->
    <str name="captureAttr">true</str>
    <str name="fmap.a">ignored_</str>
    <str name="fmap.div">ignored_</str>

    <str name="update.chain">langid</str>
  </lst>
</requestHandler>

In the java code:

  // Upload pdf content
  ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");
  up.setParam("literal.id", doc.getId().toString());
  up.setParam("literal.title", doc.getTitle());
  up.setParam("literal.description", doc.getDescription());
  up.addFile(new java.io.File(doc.getFile().getFilePath()), doc.getProcessedFile().getFile()
      .getMimeType());
  up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
  solrServer.getServer().request(up);

In this way the document language is perfectly detected.

Hope it helps someone!

Upvotes: 4

Related Questions