Reputation: 17470
By documents, I mean word, libreoffice etc, and maybe also pdfs and web pages.
In particular, for purposes of comparison, it would be nice if the plain text was in the same order as it would appear to a reader of the printed document, and if the plain text was stable, which is to say that a trivial change such as making a word boldface shouldn't change the plain text version.
Unixy answers preferred, but I'll take what I can get!
Upvotes: 0
Views: 88
Reputation:
I don't know if there is an efficient and flexible general-purpose tool for diffent file formats (apart from libreoffice already mentioned in another answer), but for those interested in PDF only, pdftotext is worth mentioning.
It is very efficient to convert PDF files to text, in particular in case of double-column pages where you can chose to replicate the original view (i.e. keep two columns in the text file) or have a continuous single-column text.
Upvotes: 1
Reputation: 17470
libreoffice does a good job on all the types of things it can read:
libreoffice --headless --convert-to txt:Text name.doc
or (looping in bash):
for i in * ;
do
echo "$i" ;
libreoffice --headless --convert-to txt:Text "$i" ;
done
Upvotes: 0