Red_Developper
Red_Developper

Reputation: 53

how to Extract the contents of a pdf file into string variables

I want to know how to do (a perl script) to extract the contents of a pdf and insert it into a database.

Example : I have a pdf file (see the example below : MyPdfFile), from this file I want to extract the item codes items (A and B), quantities (3 and 2) and prices (10 and 20) and insert them into a database (Table : ORDERS).

MyPdfFile

thanks in advance for your helps.

Upvotes: 1

Views: 251

Answers (2)

brianvolk
brianvolk

Reputation: 1

I had to add the '-' after $pdf_file to capture the pdftext into $output_of_pdftotext. my $output_of_pdftotext = pdftotext $pdf_file -; Usage: pdftotext [options] [PDF-file] [text-file]

Upvotes: 0

thb
thb

Reputation: 14424

Briefly scanning, I see no existing Perl module that does exactly what you want with minimal fuss. However, on an open-source platform, Poppler brings the utility pdftotext. Nothing prevents Perl from invoking the pdftotext binary via

my $output_of_pdftotext = `pdftotext $pdf_file`;

or

my @output_of_pdftotext = `pdftotext $pdf_file`;

If you do not mean to generalize your solution but just need something to solve your immediate problem (which, I assume, is your present orientation, insofar as you are using Perl, which excels at such usage), then my practical suggestion would be that you install Poppler's pdftotext utility, try it manually on your PDF, and see what it outputs. Then, given some minimal fluency in Perl, you can have your Perl script pattern-match the output and reformat it as you like.

CHARACTER ENCODINGS

Following up, OP asks:

[T]o extract the contents of the pdf on the stdout poppler works great, but I have a small problem of the display of some words containing accents example: désignation (in pdf) = Désignation in the standard output ?

The utf-8 character encoding encodes "é" with the two bytes C3 A9 (hexadecimal). The iso-8859-1 encoding encodes "é" with the same two bytes. Your "désignation" is evidently encoded as utf-8, which is normal, so your standard output is right. However, apparently, your terminal wants to display iso-8859-1. If so, then your terminal is misinterpreting the standard output.

You could tell pdftotext to use iso-8859-1 (I leave it to you as an exercise to read the man page and figure out how to do this). However, my recommendation would be that you instead set your terminal to display utf-8.

How to set your terminal to display utf-8? This depends on which terminal you are using. I do not know your terminal. On my terminal, changing the encoding is easy. Perhaps a few minutes of exploration and experimentation with your terminal's preferences and settings will show you how to change to utf-8.

Upvotes: 2

Related Questions