Reputation: 107
I just recently sat down and tried to write a little snippet of code that could read in a .pdf file, get certain streams (or just one in this case), uncompress it and... try to put out readable text, so basically in ASCII. From the dictionary in the stream I know its filter is DecodeFlate. By the manual this means compressed via zlib. I found an example here on stackoverflow where it was mentioned one should use gzuncompress to revert this. So, this is my code snippet.
$file = ('mypdf.pdf');
$data = fopen($file, "rb");
$size = filesize($file);
$contents = fread($data,$size);
fclose($data);
// irrelevant code finding a certain xx 0 obj and setting start_pos to it
$start_pos = strpos($contents,'stream', $start_pos);
$end_pos = strpos($contents,'endstream', $start_pos);
$start_pos = $start_pos +8;
$end_pos = $end_pos -2;
$substring = substr($contents, $start_pos, $end_pos);
$result = gzuncompress($substring);
echo $substring;
Until this point everything works as it should, I guess. The stream is found and its length is the same as mentioned in its dicitionary. Also the gzuncompress works. At this point however, I have no idea how to continue. I get somewhat along the following result:
q 1 0 0 -1 0 841.889 cm q 1 0 0 1 70.866 28.346 cm 0 g /P <> BDC BT /F21 8 Tf 1 0 0 -1 0 19.17900085 Tm [<002800090016001000010005001000110001001A00120006000500130010000A00140009000A00140011001F>] TJ ET EMC /P <> BDC BT /F21 8 Tf 1 0 0 -1 0 28.77899933 Tm
And so on, a lot of [<....>] and other stuff. And I am clueless on how to continue from now on. Or if it is even possible.
Thanks in advance
Upvotes: 2
Views: 174
Reputation: 107
As I get deeper into it, I noticed a few things about the structure of those pdfs. As they are basically all the same, I can exploit it a little bit. Only 2 type of fonts are present, F21 and F22. Both are defined before a certain object and apply to it till the end of the document. For the initial start I have to uncompress the streams, check if CIDMap is present in the decoded part. If yes, build the CMap. Then I can loop through the objects containing the data I want and transform the Hex values to the corresponding UTF8 via CMap and I should be done.
Upvotes: 1