Reputation: 11
A developer was tasked with pushing PDF files to an FTP site. By accident, each PDF was read as a string, encoded to a UTF-8 byte array, and then pushed to the FTP. Obviously, this caused problems, since PDF files are NOT TEXT.
Below is the code that was executed:
//method passed in a filepath to use for the upload
var filePath = @"C:\temp\myFile.pdf";
byte[] pdfBytes;
using (var sr = new StreamReader(filePath))
{
pdfBytes = Encoding.UTF8.GetBytes(sr.ReadToEnd());
}
//byte array was then uploaded
My question: Is there any way to REVERSE this type of corruption on a per file basis? Can you take the corrupt PDF, read its bytes, and somehow turn it back into a "PDF string"? (I know PDFs are not strings. Just trying to see if its possible to reverse the corruption)
NOTE: We've already fixed the code, and are getting the bytes as below. Just wanting to know if there is a way to UNDO what was done.
var pdfBytes = File.ReadAllBytes(filePath);
Upvotes: 1
Views: 1912
Reputation: 74595
I'm going to say "no"... Here's a side by side of W3C's dummy.pdf (on the left) and after writing the bytes back to disk post your mangling process (on the right):
You can see that a lot of the bytes on the left have been replaced with EE BF BD
- a substitution character. This means that, even though the file size has inflated, large sections of the original file have been lost (near the bottom of the screenshot you can see some plaintext elements that have been preserved). You might be able to recover text embedded in the file, but text that was rasterized to image, drawing and other objects will likely have been lost
Here's the code I used to create the second file:
var filePath = @"C:\temp\dummy.pdf";
byte[] pdfBytes;
using (var sr = new StreamReader(filePath))
{
pdfBytes = Encoding.UTF8.GetBytes(sr.ReadToEnd());
}
File.WriteAllBytes(@"C:\temp\dummy2.pdf", pdfBytes);
Upvotes: 2