Mira
Mira

Reputation: 195

Is there any way to convert part of PDF byte array to separate PDF file?

If I have part of byte array of PDF file(ex: all file byte array size is 10 MB and I have the first 5 MB only), is there any way to save that part of byte array as separate PDF file? Preferably using C#, but any other programming language will be OK

Upvotes: 2

Views: 1408

Answers (2)

David van Driessche
David van Driessche

Reputation: 7056

PDF files are built out of objects, so they are modular and random access. Arguably the most important part of the whole PDF file comes at the end of the file: it's the XREF table which provides byte offsets to all of those objects.

Not having the last part of the file means that the XREF table isn't present which is unfortunate at the least. You might be able to rebuild part of the XREF table (some PDF viewers are capable of doing that) but if you are missing half the file, the chance that this will be possible is minor to non-existent.

The only possibility left is that the PDF was saved "linearised". Such PDF files actually have all objects for the first page in the very beginning of the file and a smaller XREF table indexing only those objects needed to display the first page also relatively at the beginning of the file. This was done to make a PDF file speedier to display while it's being downloaded from a web site for example but in your case - if the PDF was created in that way - it might give you an angle to rebuild at least the first page...

PDF Forencisc
Let me just add these additional thoughts, which are perhaps a bit extreme (but it all depends how desperately you want to recuperate content from such PDF files of course).

As I said already PDF files are basically a collection of objects. Each of these objects is delineated properly (the begin and end are recognisable if you implement a correct PDF parser).

This would mean that you can begin reading at the beginning of the PDF file and build a table of objects. Each object starts with its ID so you can store an ID and the corresponding file offset for each object you find. You could continue that until you run out of file. This would mean you would roughly have half of the objects of the file in your case when you only have half the file downloaded.

The next trick would be to scan over all objects and try to find "Page" objects. These are recognisable because they have to be a dictionary and they have to contain a key called "Type" that has "Page" as its value. For each such page object you could then proceed to try to make sure all objects for that particular page are already there and if they are, save it to a new PDF document.

However...

There be dragons... Keep in mind these subtleties (and I probably forgot a bunch):

  • A page object need not have an index identifying its page number. Usually you would search for the "Pages" object and from there the position of a "Page" object in the page tree would determine its page index. If you only look at "Page" objects you may have a hard time identifying what is the first page, second page etc... You probably would have to assume that the first page is the first "Page" object in the file; but that would only be an (educated) guess.
  • Without having the end of the file, there is no way to tell whether the PDF file had at some point be edited and incrementally saved. When PDF files are saved incrementally, the modified objects are not removed from the document, new objects are simply added to the end of the file. If that happened, the objects you salvage from the PDF file might not be the latest version of the truth.

Upvotes: 2

Hussein Khalil
Hussein Khalil

Reputation: 1401

Short answer is No, This will not be possible unless you have the full ( not corrupted) 10MB file, in this case you will be able to split the by pages not by MBs

Upvotes: 3

Related Questions