Reputation: 471
I want to extract specific data from various pdfs that are 3-4 pages each. I don't want to parse everything (all the text of each page) and then using for example regular expressions in order to match the data that i want.
So i was looking the documentation, and the php pdfParser has this function $data = $pdf->getPages()[0]->getDataTm();
in which it is returnig you an array and it says that You can extract transformation matrix (indexes 0-3) and x,y position of text objects (indexes 4,5).
(https://github.com/smalot/pdfparser/blob/master/doc/Usage.md)
So i tried it and it is returning an array with all the data that i want, plus each data's coordinates..
Here an example of you to try it if you want.
require_once __DIR__ . '/vendor/autoload.php';
use Smalot\PdfParser\Parser;
$parser = new Parser();
$pdf = $parser->parseFile('pdfFile.pdf');
$data = $pdf->getPages()[0]->getDataTm();
print_r($data);
Now let's say i have the coordinates, but i don't know how to use them in order to find the exact data that i want.
I was looking the documentation for a function that you can apply the coordinates something like this functionXYcoordinates("260", "120")
in order to get what i exaclty want from my pdf.. but I couldn't find anything.
If anyone knows if there is a function like this in pdfParser, please let me know, or also feel free if you believe that extracting data via coordinates is a bad thing, and it is better by parsing all the pages and then using regular expression in order to match the specific data.
Upvotes: 2
Views: 1597
Reputation: 1
You may need to do some conversion for coordinate into XY in PDF.
Guide to do conversion from Point to mm: The size of PDF documents, how do I convert from millimeters to pixels using Spire.pdf?
Guide to use getDataTm: https://github.com/smalot/pdfparser/blob/master/doc/Usage.md
Sample(in php):
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('document.pdf');
// .. or ...
$pdf = $parser->parseContent(file_get_contents('document.pdf'))
// get the page you want; put looping if need to read multiple pages.
$pages = $pdf->getPages();
// get coordinate info of each string in the page
$dataTm = $pages->getDataTm();
// Find keyword
$keyword = "testing";
// To get page height
$details = $pages->getDetails();
$page_height = $details['MediaBox'][3];
$x=0;
$y=0;
// Matching the string with keyword
foreach ($dataTm as $element){
$pos = strpos($element[1], $keyword);
if($pos !== false){
// Convert point to mm;
$x = ($element[0][4])*0.352777778;
$y = ($page_height - $element[0][5])*0.352777778;
}
}
print_r($x);
print_r($y);
Upvotes: 0