Reputation: 109844
I have a bunch of text files with filenames that contain non-ASCII characters. For example this is a title:
readLines('bbb/ović, Melika_ Omeragić, Ismir_ Bata.txt')
## Error in file(con, "r") : cannot open the connection
## In addition: Warning message:
## In file(con, "r") :
## cannot open file 'bbb/ovi?, Melika_ Omeragi?, Ismir_ Bata.txt': Invalid argument
I try:
dir('bbb')
## [1] "ovic, Melika_ Omeragic, Ismir_ Bata.txt"
So I tried:
readLines(list.files('bbb', full.names = TRUE))
## Error in file(con, "r") : cannot open the connection
## In addition: Warning message:
## In file(con, "r") :
## cannot open file 'bbb/ovic, Melika_ Omeragic, Ismir_ Bata.txt': No such file or directory
How can I programatically read these files in? The content of the files is of no matter to this questions, just the special characters in the file names and reading the files in.
If need be if there's a way to changing the file names in order to read them in I'm open to that as well.
I realize I have no MWE but can't create one for this problem. Simply generating a text file and naming it: ović, Melika_ Omeragić, Ismir_ Bata.txt
and using the code I have above to read it in will illustrate the problem.
Upvotes: 2
Views: 974
Reputation: 61
You can rename all the files from non ascii to simpler names using a single line of code :
file.rename(Sys.glob("*"),list.files())
Indeed, the function Sys.glob
is similar to list.files
but supports better non ascii characters.
If you want to do this renaming recursively in multiple subfolders, I recommend using the fs package (functions file_move
and dir_ls
). For a little more info, maybe check my answer other there : Reading accented filenames in R using list.files .
Then readLines
should work fine, but without special characters :-)
Upvotes: 1
Reputation: 1648
The thing in Windows is pretty tricky but I was able to find a workaround using this posts:
equivalent of (dir/b > files.txt) in PowerShell
R: can't read unicode text files even when specifying the encoding
The idea I use to read the file is write its name in a file a read it from there with the appropriate encoding.
My solution is as follows (I use here
library only for reproducibility reasons):
libarary(here)
obtain.files <- function(folder){
# Obtain all files in folder and write output into file
system(paste0("cmd /K ",'cd /d "',folder,'/" & cmd /u /c "dir /b > filestmp.txt"'))
tmpfilepath <- paste0(folder,"/filestmp.txt")
# Read temporal file
# Not sure it will work in all windows versions
RL<-readLines(con <- file(tmpfilepath,encoding="UCS-2LE"))
# Remove file
file.remove(tmpfilepath)
# Keep only valid files
RL <- RL[RL!="filestmp.txt"]
return(RL)
}
folder <- here::here("bbb")
# There is only one file in the folder
files <- obtain.files(folder)
readLines(here::here("bbb",files))
I used the cmd
command found in the first post and the output was in UCS-2LE
. It might not be platform independent. With powershell
the filetmp.txt
was in UTF-16
and probably is a more general example.
Upvotes: 0
Reputation: 342
I am able to read a file in with the name ović, Melika_ Omeragić, Ismir_ Bata.txt, using readr
's read_lines_raw
. The byte sequence even seems to match the text inside, which is a good thing.
#file on my desktop
path <- '~/Desktop/ović, Melika_ Omeragić, Ismir_ Bata.txt'
##Assumming the file contains the word 'foobar'
x <- charToRaw('foobar')
#Using readr
n <- readr::read_lines_raw(path)
print(n)
[[1]]
[1] 66 6f 6f 62 61 72
print(x)
[1] 66 6f 6f 62 61 72
Hope this helps.
Upvotes: 0