anand
anand

Reputation: 1751

Getting Error in Lucene : Token period exceeds length of provided text sized 2457

I am newbie to Apache lucene first of all. I am getting the undermentioned error while adding the document to the lucene index.

org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token period exceeds length of provided text sized 2457

First of all I am unable to understand what does this error means and secondly why i am getting this error when i am adding the files individually to the lucene index.

Firstly I have two use cases:

Case1: I am creating the index for all the documents in a particular directory at once and that is working fine and no problem whatsoever.

Case2: I am giving an upload button so that user uploads the documents to that particular directory, after upload i am again calling the same program as above with little tweaks to create an index for the uploaded document, but this time for first file upload its working fine from second time onwards it's giving the above mentioned error.

Following are the snippets to get an idea what i am doing

    final String RICH_DOCUMENT_PATH = "F:\\Sample\\SampleRichDocuments";
final String INDEX_DIRECTORY = "F:\\Sample\\LuceneIndexer";

List<ContentHandler> contentHandlerList = new ArrayList<ContentHandler>();

FieldType fieldType = new FieldType();
fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
fieldType.setStoreTermVectors(true);
fieldType.setStoreTermVectorPositions(true);
fieldType.setStoreTermVectorPayloads(true);
fieldType.setStoreTermVectorOffsets(true);
fieldType.setStored(true);

// Parsing the Document using Apache Tikka
for (File file : new File(RICH_DOCUMENT_PATH).listFiles()) {
    Metadata metadata = new Metadata();

    ContentHandler handler = new BodyContentHandler(-1);
    ParseContext context = new ParseContext();
    Parser parser = new AutoDetectParser();
    InputStream stream = new FileInputStream(file);

    try {
        parser.parse(stream, handler, metadata, context);
        contentHandlerList.add(handler);
    }catch (TikaException e) {
        e.printStackTrace();
    }catch (SAXException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
    finally {
        try {
            stream.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

ArrayList<File> fileList = new ArrayList<File>();
for(File file : new File(RICH_DOCUMENT_PATH).listFiles()){
    fileList.add(file);
}

long startTime = System.currentTimeMillis();

Analyzer analyzer = new StandardAnalyzer();
Directory directory = FSDirectory.open(new File(INDEX_DIRECTORY).toPath());
IndexWriterConfig conf = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(directory, conf);

writer.deleteAll();

Iterator<ContentHandler> handlerIterator = contentHandlerList.iterator();
Iterator<File> fileIterator = fileList.iterator();

Date date = new Date();
int i = 0;

while (handlerIterator.hasNext() && fileIterator.hasNext()) {
    Document doc = new Document();
    i++;

    String text = handlerIterator.next().toString();
    String textFileName = fileIterator.next().getName();

    Field idField = new Field("document_id",String.valueOf(i),fieldType);
    Field fileNameField = new Field("file_name", textFileName, fieldType);
    Field contentField = new Field("text",text,fieldType);

    doc.add(idField);
    doc.add(contentField);
    doc.add(fileNameField);

    writer.addDocument(doc);
}

writer.commit();
writer.deleteUnusedFiles();
long endTime = System.currentTimeMillis();

I searched for the above error but unable to find the solutions for this.

My Updated code With Highlighter is :

BooleanQuery.Builder booleanQuery = null;
    Query textQuery = null;
    Query fileNameQuery = null;

    try {
        textQuery = new QueryParser("content", new StandardAnalyzer()).parse(searchText);
        fileNameQuery = new QueryParser("title", new StandardAnalyzer()).parse(searchText);
        booleanQuery = new BooleanQuery.Builder();
        booleanQuery.add(textQuery, BooleanClause.Occur.SHOULD);
        booleanQuery.add(fileNameQuery, BooleanClause.Occur.SHOULD);
    } catch (ParseException e) {
        e.printStackTrace();
    }

int hitsPerPage = 10;

IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage);
searcher.search(booleanQuery.build(), collector);

SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
    Highlighter highlighter = new Highlighter(htmlFormatter,new QueryScorer(booleanQuery.build()));

totalHits = collector.getTotalHits();

ScoreDoc[] hits = collector.topDocs().scoreDocs;
Document doc=null;


for (ScoreDoc hit : hits) {
    doc = reader.document(hit.doc);
    Document docID = searcher.doc(Integer.parseInt(doc.get("document_id"))-1);

    String docText = doc.get("text");
    String searchedSnippet = "";

    TokenStream tokenStream = TokenSources.getTokenStream(docID, "text", new StandardAnalyzer()); 
        TextFragment[] frag = highlighter.getBestTextFragments(tokenStream, docText, false, 4);

        for (int j = 0; j < frag.length; j++) {
            if ((frag[j] != null) && (frag[j].getScore() > 0)) {
                System.out.println((frag[j].toString()));
                searchedSnippet += frag[j].toString();
            }
        }   

    System.out.println("Hit on the Docs : "+hit.doc);
    System.out.println("FileName is "+doc.get("file_name"));    
}

reader.close();
index.close();

Upvotes: 2

Views: 1035

Answers (1)

femtoRgon
femtoRgon

Reputation: 33341

The problem is you are passing two different documents into the getBestTextFragments call.

I don't know exactly what you are intending to accomplish here, but the root of your problem is here:

//First Document, the actual search hit
doc = reader.document(hit.doc); 
//Second Document, looks like some sort of join?
Document docID = searcher.doc(Integer.parseInt(doc.get("document_id"))-1);

You then procede to get the get docText, the "text" or the first document, and tokenStream, a TokenStream on the "text" or the second document. That tokenStream and string to be highlighter should represent the same content. The issue you are running into is akin to an IndexOutOfBounds exception. It's found something to highlight in the TokenStream, but the corresponding location is past the end of the string to be highlighted. Like this:

doc2 stream: This|is|document|2|and|its|longer|and|we|found|a|match|here|too|far|out
                                                             ↕          ↕
doc1 string: Can't highlight past the end of this string.

Upvotes: 2

Related Questions