Reputation: 7926
I'm Building Information Retrieval System that search text in multi files formats, I have Tried EPocalipse IFilter Lirary but it through an exception when trying to read docx files, and I tried Toxy Library it though an exception for doc arabic files, finally I tried TikaOnDotNet Libray but it need java to work and I need to put the system online on hosting that don't have java installed on server
Upvotes: 5
Views: 1986
Reputation: 3750
A library which is able to extract all textual data from any type of files is the Apache Tika library. It can even extract the metadata (if any) from non-text files such as image and video files. Example use cases are shown here.
Upvotes: 2
Reputation:
What about using such libraries :
For DOC/DOCX: http://www.dotnetperls.com/word
For PDF: https://github.com/itext/itextsharp
For TXT: https://msdn.microsoft.com/en-us/library/ms143368(v=vs.110).aspx
Upvotes: 2