Reputation: 195
If I have part of byte array of PDF file(ex: all file byte array size is 10 MB and I have the first 5 MB only), is there any way to save that part of byte array as separate PDF file? Preferably using C#, but any other programming language will be OK
Upvotes: 2
Views: 1408
Reputation: 7056
PDF files are built out of objects, so they are modular and random access. Arguably the most important part of the whole PDF file comes at the end of the file: it's the XREF table which provides byte offsets to all of those objects.
Not having the last part of the file means that the XREF table isn't present which is unfortunate at the least. You might be able to rebuild part of the XREF table (some PDF viewers are capable of doing that) but if you are missing half the file, the chance that this will be possible is minor to non-existent.
The only possibility left is that the PDF was saved "linearised". Such PDF files actually have all objects for the first page in the very beginning of the file and a smaller XREF table indexing only those objects needed to display the first page also relatively at the beginning of the file. This was done to make a PDF file speedier to display while it's being downloaded from a web site for example but in your case - if the PDF was created in that way - it might give you an angle to rebuild at least the first page...
PDF Forencisc
Let me just add these additional thoughts, which are perhaps a bit extreme (but it all depends how desperately you want to recuperate content from such PDF files of course).
As I said already PDF files are basically a collection of objects. Each of these objects is delineated properly (the begin and end are recognisable if you implement a correct PDF parser).
This would mean that you can begin reading at the beginning of the PDF file and build a table of objects. Each object starts with its ID so you can store an ID and the corresponding file offset for each object you find. You could continue that until you run out of file. This would mean you would roughly have half of the objects of the file in your case when you only have half the file downloaded.
The next trick would be to scan over all objects and try to find "Page" objects. These are recognisable because they have to be a dictionary and they have to contain a key called "Type" that has "Page" as its value. For each such page object you could then proceed to try to make sure all objects for that particular page are already there and if they are, save it to a new PDF document.
However...
There be dragons... Keep in mind these subtleties (and I probably forgot a bunch):
Upvotes: 2
Reputation: 1401
Short answer is No, This will not be possible unless you have the full ( not corrupted) 10MB file, in this case you will be able to split the by pages not by MBs
Upvotes: 3