Prem
Prem

Reputation: 5997

How to extract content of doc/docx using fs api of node.js

The following works well to extract the content of the doc/docs type. My intention is to extract only the string and not images. If the code is fed with any document which contains images, it unable to process it renders enormous text that is not understood by human. Is there any way for fs module to skip the images and extract only string?

var fs = require("fs");
fs.readFile("Protractor.docx", 'utf8', function (err,data) {
    if (err) {
      return console.log(err);
    }
    console.log(data);
});

Upvotes: 0

Views: 4847

Answers (1)

Emad Dehnavi
Emad Dehnavi

Reputation: 3451

You can use mammoth library which have a extractRawText method, this only extract the text and it will ignore images and all formatting.

This is an example which extract from a docx file containing images :

const superagent = require('superagent');
const mammoth = require('mammoth');

const url = 'http://www.ojk.ee/sites/default/files/respondus-docx-sample-file_0.docx';

const main = async () => {

 const response = await superagent.get(url)
   .parse(superagent.parse.image)
   .buffer();

  const buffer = response.body;

  const text = (await mammoth.extractRawText({ buffer })).value;
  const lines = text.split('\n');

  console.log(lines);
};

main().catch(error => console.error(error));

Upvotes: 1

Related Questions