How to read a file with unknown encoding (FDF)

Question

I have this file. It is a plain text file. I am trying to find a way to simply read this file into R and write it back again in the same way it was originally encoded. My motivation is to be able to reliably reproduce the file format. However I am having difficulties deciphering how this file was encoded.

The problem lies in line 9 where it was supposed to read something like

/V (½ þ ¾ → ‘ ’ ” “ •)

and deep down, I know that these characters really are encoded in this file because an external utility (pdftk) I use can correctly read them. However if I do

readLines('https://github.com/oganm/toSource/raw/master/cant_read.fdf',
          encoding = 'UFT-8')

I get a warning

Warning message:
In readLines("https://github.com/oganm/toSource/raw/master/cant_read.fdf",  :
  line 9 appears to contain an embedded nul

and line 9 appears to be truncated and encoded weirdly encoded.

readLines('https://github.com/oganm/toSource/raw/master/cant_read.fdf',
          encoding = 'UTF-8')[9]

[1] "/V (\xfe\xff"

If I use the other option latin1 I get the wrong characters along with the same warning

readLines('https://github.com/oganm/toSource/raw/master/cant_read.fdf',
          encoding = 'latin1')[9]

[1] "/V (þÿ"

Looking at the relationship between the two versions, \xfe\xff appears to look like the latin1 codes for these characters so it makes sense that this is what I see. However I also know that that is not supposed to be what I see.

Since the output of readLines is truncated to begin with it is not possible to re-create the same file anyway but my ultimate aim is to be able to manipulate this file so I need a deeper understanding of what is going on.

I have also tried various text editors to open the file using different encoding options ("UTF-8", "UTF-16", "Western"), but none of these seems to show the file the way it is. So the question is how can I read/write this file and/or what are the steps I can take that will help me decode it

Edit: If I try to skip the embeded nul using the skipNul command, the truncation issue is resolved but I am still left with the weird encoding that I can't write back to a file

readLines('https://github.com/oganm/toSource/raw/master/cant_read.fdf',
          encoding = 'UTF-8',skipNul=TRUE)[9]

[1] "/V (\xfe\xff\xbd \xfe \xbe !\x92  \030  \031  \035  \034  \")"

readLines('https://github.com/oganm/toSource/raw/master/cant_read.fdf',
          encoding = 'latin1',skipNul=TRUE)[9]

 "/V (þÿ½ þ ¾ !’  \030  \031  \035  \034  \")"

In latin1 some characters at least are correctly recovered. But I wasn't able to establish a relationship between the rest of the string to the original input

Note: The þ that appears is not related to the actual þ in the file. I actually added the þ later to see how it would effect the output. It didn't change anything which implies the truncation happens at the encoding of ½ and the data that we can read is probably part of ½.

zeehio · Accepted Answer

The encoding of the file is mixed.

Most of the PDF seems to be in latin1, as the first characters should be "%âãÏÓ". (See: PDF File header sequence: Why '25 e2 e3 cf d3' bits stream used in many document?)

However the text within the "/V" command is encoded in UTF-16 little endian. The "fe ff" bytes are actually the byte order mark of the text.

You will probably need to resort to using readBin and converting the bytes to the right encoding. PDFs are horrible to parse.

See this http://stat545.com/block034_useR-encoding-case-study.html post on how to read files with mixed encoding using readBin. The iconv function may be useful as well for encoding conversion

How to read a file with unknown encoding (FDF)

Answers (1)

Related Questions