Elastic Architecture Full Text Searching on 1 million file's content

Question

Summary

I am trying to design an elastic index(s) that will provide a solid foundation for indexing 1,000,000+ Files and full text searching on the contents. New files will be continuously added after the initial digitization process.

Use Case

Various File Types (Pdf, outlook email, mp3, txt, jpeg of handwritten things, ..etc) need to be searchable by their contents and meta-data. Users want to manually tag relationships between documents. ex Document A -> contains information about -> Document B. Users want to be able to see related/similar texts. Users want Named Entity Recognition on the text contents. The physical files are already stored on an external computer just waiting to be processed.

Implementation

File Content extraction pipeline using Apache Tika
NER using Spacy
Upload File Contents + NER Tags to Elastic
Eventually we would run our own search models to gain better search insights + data science.

How do I best store my extracted contents to fit the needs of the user and have a scalable foundation? Is it better to run our trained Named Entity Recognition on initial index or after text extraction has been uploaded to elastic?

Or does it make more sense to use an existing solution from below to not reinvent the wheel?

https://github.com/dadoonet/fscrawler

https://github.com/deepset-ai/haystack

https://github.com/RD17/ambar

Elastic Architecture Full Text Searching on 1 million file's content

Answers (1)

Related Questions

Elastic Architecture Full Text Searching on 1 million file&#39;s content

Answers (1)

Related Questions

Elastic Architecture Full Text Searching on 1 million file's content