Reputation: 609
How can I get pure string from a document eliminating all the images or tables or figures. I will manipulate and create a word list of these documents. So I need just text part of documents using C#
Upvotes: 3
Views: 240
Reputation: 5256
You probably need to look into IFilters. They're how most search indexers access plain text from documents on Windows. Here's a tutorial and sample project with source code you can use to extract text from Office documents and PDFs, etc.
You just need to make sure you have the correct IFilters installed on your machine. Microsoft provides a free set of filters for Office Documents. Adobe also provides a filter, but it's complete garbage. If you can, try the FoxIt IFilter, it's much much better.
Upvotes: 1
Reputation: 8269
You have to support each document's specific format; there is no generic method of reading all document formats.
For example, Microsoft Office Word document files need to be interpreted by their own library, as opposed to OpenOffice document files.
Upvotes: 0