ThunderBoy
ThunderBoy

Reputation: 471

Extracting specific data via coordinates using php pdfParser

I want to extract specific data from various pdfs that are 3-4 pages each. I don't want to parse everything (all the text of each page) and then using for example regular expressions in order to match the data that i want.

So i was looking the documentation, and the php pdfParser has this function $data = $pdf->getPages()[0]->getDataTm(); in which it is returnig you an array and it says that You can extract transformation matrix (indexes 0-3) and x,y position of text objects (indexes 4,5). (https://github.com/smalot/pdfparser/blob/master/doc/Usage.md)

So i tried it and it is returning an array with all the data that i want, plus each data's coordinates..

Here an example of you to try it if you want.

require_once __DIR__ . '/vendor/autoload.php';
use Smalot\PdfParser\Parser;

$parser = new Parser();
$pdf = $parser->parseFile('pdfFile.pdf');

$data = $pdf->getPages()[0]->getDataTm();
print_r($data);

Now let's say i have the coordinates, but i don't know how to use them in order to find the exact data that i want. I was looking the documentation for a function that you can apply the coordinates something like this functionXYcoordinates("260", "120") in order to get what i exaclty want from my pdf.. but I couldn't find anything.

If anyone knows if there is a function like this in pdfParser, please let me know, or also feel free if you believe that extracting data via coordinates is a bad thing, and it is better by parsing all the pages and then using regular expression in order to match the specific data.

Upvotes: 2

Views: 1597

Answers (1)

CH2
CH2

Reputation: 1

You may need to do some conversion for coordinate into XY in PDF.

Guide to do conversion from Point to mm: The size of PDF documents, how do I convert from millimeters to pixels using Spire.pdf?

Guide to use getDataTm: https://github.com/smalot/pdfparser/blob/master/doc/Usage.md

Sample(in php):

$parser = new \Smalot\PdfParser\Parser();

$pdf = $parser->parseFile('document.pdf');
// .. or ...
$pdf = $parser->parseContent(file_get_contents('document.pdf'))

// get the page you want; put looping if need to read multiple pages. 
$pages = $pdf->getPages(); 

// get coordinate info of each string in the page
$dataTm = $pages->getDataTm();

// Find keyword
$keyword = "testing"; 

// To get page height 
$details = $pages->getDetails();
$page_height = $details['MediaBox'][3];

$x=0;
$y=0;

// Matching the string with keyword
foreach ($dataTm as $element){
        $pos = strpos($element[1], $keyword);
        if($pos !== false){
               
        // Convert point to mm;
                $x = ($element[0][4])*0.352777778;
                $y = ($page_height - $element[0][5])*0.352777778;
        }
}

print_r($x);
print_r($y);

Upvotes: 0

Related Questions