user2514644
user2514644

Reputation: 39

What is wrong with my encoding, when reading characters from PDF?

I'm reading a PDF file with C#, but the characters are coming from another encoding, and returning different characters than those which I expected from when I view the file in a PDF viewer.

I thought a UTF-8 encoding would be correct.

What am I doing wrong?

string file = @"c:\document.pdf";
Stream stream = File.Open(file, FileMode.Open);
BinaryReader binaryReady = new BinaryReader(stream);
byte[] buffer = binaryReady.ReadBytes(Convert.ToInt32(stream.Length));
var encoder = UTF8Encoding.UTF8.GetString(buffer);

Upvotes: 0

Views: 642

Answers (1)

mcmonkey4eva
mcmonkey4eva

Reputation: 1377

PDF is a very complex multi-part file, it is not just UTF8 text.

If you want to read a PDF file, you must read over the full PDF File Format Documentation and fully implement the large and complex details of how the file format works.

Upvotes: 4

Related Questions