gencay
gencay

Reputation: 609

getting pure text from a document using c#

How can I get pure string from a document eliminating all the images or tables or figures. I will manipulate and create a word list of these documents. So I need just text part of documents using C#

Upvotes: 3

Views: 240

Answers (2)

Andrew Lewis
Andrew Lewis

Reputation: 5256

You probably need to look into IFilters. They're how most search indexers access plain text from documents on Windows. Here's a tutorial and sample project with source code you can use to extract text from Office documents and PDFs, etc.

You just need to make sure you have the correct IFilters installed on your machine. Microsoft provides a free set of filters for Office Documents. Adobe also provides a filter, but it's complete garbage. If you can, try the FoxIt IFilter, it's much much better.

Upvotes: 1

BeemerGuy
BeemerGuy

Reputation: 8269

You have to support each document's specific format; there is no generic method of reading all document formats.
For example, Microsoft Office Word document files need to be interpreted by their own library, as opposed to OpenOffice document files.

Upvotes: 0

Related Questions