Reputation:
I have an Alfresco model type with an additional property of type d:content
. This property causes Solr exceptions when I try to store content larger than 32 KB in it. The current definition of this property is
<property name="acme:secondContent">
<type>d:content</type>
<mandatory>false</mandatory>
<index enabled="true">
<atomic>true</atomic>
<stored>true</stored>
<tokenised>both</tokenised>
</index>
</property>
If I put content larger that 32 KB into this property, Solr throws this exception when it tries to index it:
java.lang.IllegalArgumentException: Document contains at least one immense term in field="content@s____@{http://acme.com/model/custom/1.0}secondContent" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms.
Changing the index
configuration does not help, the error is thrown with all variants of index
and the sub-elements I've tried.
In another question it is answered:
The maximum size for the a single term in the underlying Lucene index is 32776 bytes, which is I believe hard coded.
How do I configure the index
of a d:content
property so that I can save and index content larger than 32 KB?
Edit:
In contentModel.xml
, cm:content
is configured like this:
<index enabled="true">
<atomic>true</atomic>
<stored>false</stored>
<tokenised>true</tokenised>
</index>
Adding a simple text/plain
file with content larger than 32 KB works without problems.
The same index
configuration for my custom property still fails.
Update:
Under Alfresco 4.2fCE, the problem does not occur. So this is a bug in Alfresco 5.0c together with Solr 4.1.9.
Update 2:
I've filed a bug in the Alfresco JIRA.
Upvotes: 7
Views: 1069
Reputation: 2737
The solution is not to store the full doc/part in the index. So try to avoid store=true and tokenize=both/false on large properties having > 32k. Indexing should work if your model declaration looks like:
<property name="acme:secondContent">
<type>d:content</type>
<mandatory>false</mandatory>
<index enabled="true">
<atomic>true</atomic>
<stored>false</stored>
<tokenised>true</tokenised>
</index>
</property>
drawback: In my test I had to drop the whole index. I was not sufficient to delete the cached models in solr
Upvotes: 2
Reputation: 466
Hypothesis 1
If you have contents that contains similar very long terms (single words with 32k of length), you have to implement your own Lucene analyzers for supporting that specific type of text. This means that it is a problem related to the default Lucene implementation because it is hardcoded.
Hypothesis 2
Otherwise if your content is not structured in the way above, it sounds strange to me and probably could be a bug. If you are not solving using tokenised=true, in this case, a potential workaround could be based on changing the content model to support an association between the parent node and the specific type of node that contains the involved text but using the default cm:content property. I mean using associations you should solve ;)
Hope this helps.
Upvotes: 5