torrybr
torrybr

Reputation: 23

Elastic Architecture Full Text Searching on 1 million file's content

Summary

I am trying to design an elastic index(s) that will provide a solid foundation for indexing 1,000,000+ Files and full text searching on the contents. New files will be continuously added after the initial digitization process.

Use Case

Various File Types (Pdf, outlook email, mp3, txt, jpeg of handwritten things, ..etc) need to be searchable by their contents and meta-data. Users want to manually tag relationships between documents. ex Document A -> contains information about -> Document B. Users want to be able to see related/similar texts. Users want Named Entity Recognition on the text contents. The physical files are already stored on an external computer just waiting to be processed.

Implementation

  1. File Content extraction pipeline using Apache Tika
  2. NER using Spacy
  3. Upload File Contents + NER Tags to Elastic
  4. Eventually we would run our own search models to gain better search insights + data science.

How do I best store my extracted contents to fit the needs of the user and have a scalable foundation? Is it better to run our trained Named Entity Recognition on initial index or after text extraction has been uploaded to elastic?

Or does it make more sense to use an existing solution from below to not reinvent the wheel?

https://github.com/dadoonet/fscrawler

https://github.com/deepset-ai/haystack

https://github.com/RD17/ambar

Upvotes: 1

Views: 518

Answers (1)

invider
invider

Reputation: 353

Instead of reinventing the wheel, I'd recommend to use existing solutions such as Jina, there's a working example of pdf search using Jina. You can also search across different modalities(text, image, pdf, etc.) using this.

Upvotes: 1

Related Questions