John Lawrence Aspden
John Lawrence Aspden

Reputation: 17470

Is there a general solution for converting documents to plain text?

By documents, I mean word, libreoffice etc, and maybe also pdfs and web pages.

In particular, for purposes of comparison, it would be nice if the plain text was in the same order as it would appear to a reader of the printed document, and if the plain text was stable, which is to say that a trivial change such as making a word boldface shouldn't change the plain text version.

Unixy answers preferred, but I'll take what I can get!

Upvotes: 0

Views: 88

Answers (2)

user8246956
user8246956

Reputation:

I don't know if there is an efficient and flexible general-purpose tool for diffent file formats (apart from libreoffice already mentioned in another answer), but for those interested in PDF only, pdftotext is worth mentioning.

It is very efficient to convert PDF files to text, in particular in case of double-column pages where you can chose to replicate the original view (i.e. keep two columns in the text file) or have a continuous single-column text.

Upvotes: 1

John Lawrence Aspden
John Lawrence Aspden

Reputation: 17470

libreoffice does a good job on all the types of things it can read:

libreoffice --headless --convert-to txt:Text name.doc

or (looping in bash):

for i in * ; 
do 
  echo "$i" ;
  libreoffice --headless --convert-to txt:Text "$i" ; 
done

Upvotes: 0

Related Questions