Piotr Sarnacki
Piotr Sarnacki

Reputation:

Programatically get pages count in Microsoft Word documents on linux

I need to get pages count from word documents. I've tested many libraries and scripts (apache poi, perl scripts, some application for linux and some more) and the only working solution was to install Microsoft Office with Wine and access OLE with perl. I've managed to do it but it seems I can't use it on server due to licensing problems...

The problem with apachepoi and other solutions providing access to word documents info is related to incompleteness of some docs. pageCount property in document summary is sometimes missing (it's often case with odt documents saved as doc and older docs).

Is there any way to actually count pages (not only get info from summary) without installing Microsoft Office on server?

Upvotes: 2

Views: 4064

Answers (3)

Tai Paul
Tai Paul

Reputation: 920

This is a version that also gets the page count from the document summary. I've added it late because MS Word has been through a number of updates since the question was asked.

The environment in which the following works is:

  • GNU bash, version 5.1.0(1)-release (x86_64-redhat-linux-gnu)
  • MS Word version 12 (2007) and version 16 (2016 - 2021)

It does not work for MS Word 9 (2000), and I assume earlier. I've also not tested the code on other shells.


DOCUMENT=<YourDocumentName>
PROPS=`unzip -c "$DOCUMENT" docProps/app.xml | tail -1`
NUMBER_OF_PAGES=`sed -e 's/.*\(<Pages>[0-9]*<\/Pages>\).*/\1/' <<< $PROPS | cut -d'>' -f2 | cut -d'<' -f1`

The PROPS variable is used so that you might get the number of lines or words without reading the entire file again.


NUMBER_OF_LINES=`sed -e 's/.*\(<Lines>[0-9]*<\/Lines>\).*/\1/' <<< $PROPS | cut -d'>' -f2 | cut -d'<' -f1`
NUMBER_OF_WORDS=`sed -e 's/.*\(<Words>[0-9]*<\/Words>\).*/\1/' <<< $PROPS | cut -d'>' -f2 | cut -d'<' -f1`

You can view the other properties with:

echo $PROPS

Upvotes: 0

PRMan
PRMan

Reputation: 595

If you trust the document summary, instead of using wvSummary, you can just open the file and do a Regex search for "nofpages(\d+)". Groups[1] will contain the number of pages.

Since Word always saves the summary when it saves, I think this is pretty safe if you know the document was last saved with Word, which in my experience is 99% of the time.

Upvotes: 1

Matthew Flaschen
Matthew Flaschen

Reputation: 284836

I was going to say wvSummary, but I think this uses the metadata you're referring to. I'm not sure there is a way to get the page count without actually laying out the document. So you might have to resort to using APIs to drive a real Office-compatible application like OpenOffice or AbiWord.

Upvotes: 2

Related Questions