How to extract content of doc/docx using fs api of node.js

Question

The following works well to extract the content of the doc/docs type. My intention is to extract only the string and not images. If the code is fed with any document which contains images, it unable to process it renders enormous text that is not understood by human. Is there any way for fs module to skip the images and extract only string?

var fs = require("fs");
fs.readFile("Protractor.docx", 'utf8', function (err,data) {
    if (err) {
      return console.log(err);
    }
    console.log(data);
});

Emad Dehnavi · Accepted Answer

You can use mammoth library which have a extractRawText method, this only extract the text and it will ignore images and all formatting.

This is an example which extract from a docx file containing images :

const superagent = require('superagent');
const mammoth = require('mammoth');

const url = 'http://www.ojk.ee/sites/default/files/respondus-docx-sample-file_0.docx';

const main = async () => {

 const response = await superagent.get(url)
   .parse(superagent.parse.image)
   .buffer();

  const buffer = response.body;

  const text = (await mammoth.extractRawText({ buffer })).value;
  const lines = text.split('
');

  console.log(lines);
};

main().catch(error => console.error(error));

How to extract content of doc/docx using fs api of node.js

Answers (1)

Related Questions