Reputation: 2663
I have this file. It is a plain text file. I am trying to find a way to simply read this file into R and write it back again in the same way it was originally encoded. My motivation is to be able to reliably reproduce the file format. However I am having difficulties deciphering how this file was encoded.
The problem lies in line 9 where it was supposed to read something like
/V (½ þ ¾ → ‘ ’ ” “ •)
and deep down, I know that these characters really are encoded in this file because an external utility (pdftk) I use can correctly read them. However if I do
readLines('https://github.com/oganm/toSource/raw/master/cant_read.fdf',
encoding = 'UFT-8')
I get a warning
Warning message:
In readLines("https://github.com/oganm/toSource/raw/master/cant_read.fdf", :
line 9 appears to contain an embedded nul
and line 9 appears to be truncated and encoded weirdly encoded.
readLines('https://github.com/oganm/toSource/raw/master/cant_read.fdf',
encoding = 'UTF-8')[9]
[1] "/V (\xfe\xff"
If I use the other option latin1
I get the wrong characters along with the same warning
readLines('https://github.com/oganm/toSource/raw/master/cant_read.fdf',
encoding = 'latin1')[9]
[1] "/V (þÿ"
Looking at the relationship between the two versions, \xfe\xff
appears to look like the latin1 codes for these characters so it makes sense that this is what I see. However I also know that that is not supposed to be what I see.
Since the output of readLines
is truncated to begin with it is not possible to re-create the same file anyway but my ultimate aim is to be able to manipulate this file so I need a deeper understanding of what is going on.
I have also tried various text editors to open the file using different encoding options ("UTF-8", "UTF-16", "Western"), but none of these seems to show the file the way it is. So the question is how can I read/write this file and/or what are the steps I can take that will help me decode it
Edit: If I try to skip the embeded nul using the skipNul
command, the truncation issue is resolved but I am still left with the weird encoding that I can't write back to a file
readLines('https://github.com/oganm/toSource/raw/master/cant_read.fdf',
encoding = 'UTF-8',skipNul=TRUE)[9]
[1] "/V (\xfe\xff\xbd \xfe \xbe !\x92 \030 \031 \035 \034 \")"
readLines('https://github.com/oganm/toSource/raw/master/cant_read.fdf',
encoding = 'latin1',skipNul=TRUE)[9]
"/V (þÿ½ þ ¾ !’ \030 \031 \035 \034 \")"
In latin1
some characters at least are correctly recovered. But I wasn't able to establish a relationship between the rest of the string to the original input
Note: The þ
that appears is not related to the actual þ
in the file. I actually added the þ
later to see how it would effect the output. It didn't change anything which implies the truncation happens at the encoding of ½
and the data that we can read is probably part of ½
.
Upvotes: 2
Views: 652
Reputation: 4138
The encoding of the file is mixed.
Most of the PDF seems to be in latin1, as the first characters should be "%âãÏÓ". (See: PDF File header sequence: Why '25 e2 e3 cf d3' bits stream used in many document?)
However the text within the "/V" command is encoded in UTF-16 little endian. The "fe ff" bytes are actually the byte order mark of the text.
You will probably need to resort to using readBin and converting the bytes to the right encoding. PDFs are horrible to parse.
See this http://stat545.com/block034_useR-encoding-case-study.html post on how to read files with mixed encoding using readBin. The iconv function may be useful as well for encoding conversion
Upvotes: 2