newbiedev
newbiedev

Reputation: 3586

pdf.js-extractor - pdf files aren't parsed correctly

I'm using pdf.js-extractor in a node cli script. I'm trying to extract a database of questions and answers that after that the file is processed will have this structure:

[
  '324',
  ' ',
  "Di quale dei seguenti arcipelaghi fa parte l'isola di ",
  'Delle isole Ponziane',
  ' ',
  'Delle isole Pelagie',
  ' ',
  'Delle isole Egadi',
  ' ',
  'Delle isole Eolie',
  ' ',
  'C',
  ' '
],
[ 'Favignana?', ' ' ],
[
  '325',
  ' ',
  'Di quale di queste città la cattedrale di Santa Maria ',
  'Napoli',
  ' ',
  'Firenze',
  ' ',
  'Roma',
  ' ',
  'Genova',
  ' ',
  'B',
  ' '
],
[ 'del Fiore è conosciuta semplicemente come il ' ],
[ 'Duomo?', ' ' ]

I've noticed that the pdf contents are splitted in a wrong way, the answers and the correct answer letter one are correctly listed but the question will be displayed in a worng way.

The expected correct format for each question is the following

[
  '324',
  "Di quale dei seguenti arcipelaghi fa parte l'isola di Favignana?",
  'Delle isole Ponziane',
  'Delle isole Pelagie',
  'Delle isole Egadi',
  'Delle isole Eolie',
  'C', // correct answer letter
],
[
  '325',
  'Di quale di queste città la cattedrale di Santa Maria del Fiore è conosciuta semplicemente come il Duomo?',
  'Napoli',
  'Firenze',
  'Roma',
  'Genova',
  'B', // this is the correct answer letter
]

I'm processing the pdf using this code

pdf.extract(pdfFile, {  
  firstPage: 2,
  normalizeWhitespace: true
}).then( (data) =>  {
  //console.log(data);
  spinner.stop();

  data.pages.forEach( (page) => {
    const lines = PdfExtract.utils.pageToLines(page, 1);
    const rows = PdfExtract.utils.extractTextRows(lines);
    fileContent.push(rows);
  });
  fileContent = fileContent.map( (row) => {
    return row.join('');
  });

  console.log(fileContent);

}).catch( (error) => console.log(error) );

How I can correctly extract the pdf content and solve the problem?

Upvotes: 0

Views: 345

Answers (1)

Murat Colyaran
Murat Colyaran

Reputation: 2189

I believe the problem is about asynchronous code.

I converted your code like this. That might solve the problem if your pdf data is correct

const data = await pdf.extract(pdfFile, {
    firstPage: 2,
    normalizeWhitespace: true
});
await spinner.stop();
for(var page of data.pages) {
    const lines = await PdfExtract.utils.pageToLines(page, 1);
    const rows = await PdfExtract.utils.extractTextRows(lines);
    fileContent.push(rows);
}
fileContent = fileContent.map((row) => {
    return row.join('');
});
console.log(fileContent);

Upvotes: 2

Related Questions