user1436187
user1436187

Reputation: 3376

Determine file type in R based on the content

In linux we can use file command to get the file type based on the content of the file (not extension). Is there any similar function in R?

Upvotes: 7

Views: 7506

Answers (2)

mayeulk
mayeulk

Reputation: 148

dqmagic is not on CRAN. Below an R solution which uses linux's "file" command (actually BSD's 'file' v5.35 dated October 2018, packaged in Ubuntu 19.04, according to man page)

file_full_path <- "/home/user/Documents/an_RTF_document.doc"
file_mime_type <- system2(command = "file",
  args = paste0(" -b --mime-type ", file_full_path), stdout = TRUE) # "text/rtf"
# Gives the list of potentially allowed extension for this mime type:
file_possible_ext <- system2(command = "file",
  args = paste0(" -b --extension ", file_full_path),
  stdout = TRUE) # "???". "doc/dot" for MsWord files.

It could be necessary to check that the actual extension is known to be a valid extension for the given mime type (for instance, readtext::readtext() reads an RTF file but fails if it is saved as *.doc).

file.basename <- basename(file_full_path)
file.base_without_ext <-sub(pattern = "(.*)\\..*$",
  replacement = "\\1", file.basename)
file.nchar_ext <- nchar(file.basename) - 
  nchar(file.base_without_ext)-1 # 3 or 4 (doc, docx, odt...)
file_ext <- substring(file.basename, nchar(file.basename) -
  file.nchar_ext +1) # doc, rtf...
if (file_mime_type == "text/rtf"){
   file_possible_ext <- "rtf"
} # in some (all?) cases, for an rtf mime-type, 
  #'file' outputs "???" as allowed extension

# Returns TRUE if the actual extension is known to 
# be a valid extension for the given mime type:
length(grep(file_ext, file_possible_ext, ignore.case = TRUE)) > 0

Upvotes: 4

Ralf Stubner
Ralf Stubner

Reputation: 26823

Old question but maybe relevant for people getting here via google: You can use dqmagic, a wrapper around libmagic for R, to determine the file type based on the files content. Since file uses the same library, the results are the same, e.g.:

library(dqmagic)
file_type("DESCRIPTION")
#> [1] "ASCII text"
file_type("src/file.cpp")
#> [1] "C source, ASCII text"

vs.

$ file DESCRIPTION src/file.cpp 
DESCRIPTION:  ASCII text
src/file.cpp: C source, ASCII text

Disclaimer: I am the author of the package.

Upvotes: 4

Related Questions