Lucene on Maven - java.lang.IllegalArgumentException UTF8 encoding is longer than the max length 32766 error

Question

I am trying to index a large document that is over the limit of string length with Lucene Maven. Then, I receive this error.

Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="content" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[65, 32, 98, 101, 110, 122, 111, 100, 105, 97, 122, 101, 112, 105, 110, 101, 32, 91, 116, 112, 108, 93, 73, 80, 65, 99, 45, 101, 110, 124]...', original message: bytes can be at most 32766 in length; got 85391

The code is as below (It's a copy from the http://lucenetutorial.com/lucene-in-5-minutes.html with a very slight change to read the document from the file.):

File file = "doc.txt";

StandardAnalyzer analyzer = new StandardAnalyzer();
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter w = new IndexWriter(index, config);
Document doc = new Document();
Scanner scanner = new Scanner(file))
     while (scanner.hasNextLine())
     {
          String line = scanner.nextLine();
          doc.add(new StringField("content", line, Field.Store.YES));
          w.addDocument(doc);
     }

...

There are other posts with the same issue as what I am having, but they are solutions for SOLR or Elasticsearch, not for Lucene on Maven, so I am not quite sure how to solve this problem.

Can anyone direct me to the right place to solve this issue, please?

Thank you in advance.

Sascha · Accepted Answer

If you want to index a text and not single words, you should use something that can break your text down to words, like a WhitespaceAnalyzer.

Lucene on Maven - java.lang.IllegalArgumentException UTF8 encoding is longer than the max length 32766 error

Answers (1)

Related Questions