Shekar Patel
Shekar Patel

Reputation: 231

Integration of Solr with EMC Documentum

We have bunch of pdf documents available in EMC Documentum We have a requirement we have to integrate Apache solr with Documentum, so that we can search for a specific document in Solr, and we can get the documents from Documentum

I looked into below link which is not sufficient information https://community.emc.com/docs/DOC-6520

Help is really appriciated

Upvotes: 0

Views: 714

Answers (2)

Mohamad Ibrahim
Mohamad Ibrahim

Reputation: 5565

I have built my own connecter to extract data from Documentum and insert in Elasticsearch or solr and I am willing to share. please contact me

Upvotes: 0

cheffe
cheffe

Reputation: 9500

The link you have posted would get you a working solution. That author proposes to write a custom crawler that connects to the Documentum repository and then use Apache Tika to perform the content extraction for Solr.

However I would suggest you to use

  • Apache ManifoldCF to act as crawler that gets the content from Documentum to Solr. You should not write this by hand, as it already has been done and tested.

    Apache ManifoldCF is an effort to provide an open source framework for connecting source content repositories like Microsoft Sharepoint and EMC Documentum, to target repositories or indexes, such as Apache Solr, Open Search Server, or ElasticSearch. Apache ManifoldCF also defines a security model for target repositories that permits them to enforce source-repository security policies.

  • Apache Tika to perform the content extraction (PDF to text) so that the content of the documents is searchable in Solr later on.

    The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.

Upvotes: 1

Related Questions