Reputation: 1365
The R Data Import/Export Manual says that there is a good way to guess the encoding of a text file is to use the "file" command line tool (available in R tools). How would one use this? I already have the newest version of Rtools installed. Is this something I can do from my R session? Or do I need to open up the command prompt?
Upvotes: 0
Views: 640
Reputation: 14912
The "command prompt" here refers to a "Terminal" window (OS X or Linux) or "Command Prompt" (Windows). From these, you have access to the command-line file
utility, which as the manual states, provides a good description of the type and format of (text) files.
You can also run this straight from R, using the system()
function to pass the call to file
. For example, on my system, in the current working directory I have three text files:
> list.files(pattern = "*.txt")
[1] "00005802.txt" "googlebooks-eng-all-totalcounts-20120701.txt"
[3] "sentences.txt"
> system("file *.txt")
00005802.txt: Par archive data
googlebooks-eng-all-totalcounts-20120701.txt: ASCII text, with very long lines, with no line terminators
sentences.txt: ASCII English text, with very long lines
It could be that file will call something "plain ASCII" when it only contains the lower 128 ASCII characters, but this will be the same as UTF-8 since those two encodings share the same 8-bit mappings of the first 128 ASCII characters.
Also, file is not always right -- for instance the 00005802.txt
is in fact UTF-8 encoded text that I converted from a pdf using pdftotext
.
Also beware that on most Windows platforms, you cannot set your system locale to UTF-8 in R. Try Sys.getlocale()
. (To set it, use Sys.setlocale()
).
Upvotes: 1
Reputation: 2280
In the context of R Data Import/Export Manual, I interpret it as using a file
on a command prompt.
However you can invoke a system command with system() function from R. For example if I have a file called mpi.R in the current directory, I can do:
> foo <- system('file mpi.R', intern=TRUE, ignore.stdout=FALSE, ignore.stderr=TRUE, wait=TRUE)
> print(foo)
[1] "mpi.R: ASCII text"
Upvotes: 1