Reputation: 3376
In linux we can use file
command to get the file type based on the content of the file (not extension). Is there any similar function in R?
Upvotes: 7
Views: 7506
Reputation: 148
dqmagic is not on CRAN. Below an R solution which uses linux's "file" command (actually BSD's 'file' v5.35 dated October 2018, packaged in Ubuntu 19.04, according to man page)
file_full_path <- "/home/user/Documents/an_RTF_document.doc"
file_mime_type <- system2(command = "file",
args = paste0(" -b --mime-type ", file_full_path), stdout = TRUE) # "text/rtf"
# Gives the list of potentially allowed extension for this mime type:
file_possible_ext <- system2(command = "file",
args = paste0(" -b --extension ", file_full_path),
stdout = TRUE) # "???". "doc/dot" for MsWord files.
It could be necessary to check that the actual extension is known to be a valid extension for the given mime type (for instance, readtext::readtext() reads an RTF file but fails if it is saved as *.doc).
file.basename <- basename(file_full_path)
file.base_without_ext <-sub(pattern = "(.*)\\..*$",
replacement = "\\1", file.basename)
file.nchar_ext <- nchar(file.basename) -
nchar(file.base_without_ext)-1 # 3 or 4 (doc, docx, odt...)
file_ext <- substring(file.basename, nchar(file.basename) -
file.nchar_ext +1) # doc, rtf...
if (file_mime_type == "text/rtf"){
file_possible_ext <- "rtf"
} # in some (all?) cases, for an rtf mime-type,
#'file' outputs "???" as allowed extension
# Returns TRUE if the actual extension is known to
# be a valid extension for the given mime type:
length(grep(file_ext, file_possible_ext, ignore.case = TRUE)) > 0
Upvotes: 4
Reputation: 26823
Old question but maybe relevant for people getting here via google: You can use dqmagic, a wrapper around libmagic for R, to determine the file type based on the files content. Since file
uses the same library, the results are the same, e.g.:
library(dqmagic)
file_type("DESCRIPTION")
#> [1] "ASCII text"
file_type("src/file.cpp")
#> [1] "C source, ASCII text"
vs.
$ file DESCRIPTION src/file.cpp
DESCRIPTION: ASCII text
src/file.cpp: C source, ASCII text
Disclaimer: I am the author of the package.
Upvotes: 4