Reputation: 5997
The following works well to extract the content of the doc/docs type. My intention is to extract only the string and not images. If the code is fed with any document which contains images, it unable to process it renders enormous text that is not understood by human. Is there any way for fs
module to skip the images and extract only string
?
var fs = require("fs");
fs.readFile("Protractor.docx", 'utf8', function (err,data) {
if (err) {
return console.log(err);
}
console.log(data);
});
Upvotes: 0
Views: 4847
Reputation: 3451
You can use mammoth library which have a extractRawText
method, this only extract the text and it will ignore images and all formatting.
This is an example which extract from a docx file containing images :
const superagent = require('superagent');
const mammoth = require('mammoth');
const url = 'http://www.ojk.ee/sites/default/files/respondus-docx-sample-file_0.docx';
const main = async () => {
const response = await superagent.get(url)
.parse(superagent.parse.image)
.buffer();
const buffer = response.body;
const text = (await mammoth.extractRawText({ buffer })).value;
const lines = text.split('\n');
console.log(lines);
};
main().catch(error => console.error(error));
Upvotes: 1