Reputation: 2017
I'm working with files containing text in Hindi and parsing them. I wrote my code in Rstudio and executed it without many issues. But now, I need to execute the same script from command line using R.exe/Rscript.exe and it doesn't work the same way. I've run a simple script from both RStudio and the terminal:
n_p<-'नाम'
Encoding(n_p)
gregexpr(n_p,c('adfdafc','नाम adsfdfa'))
sessionInfo()
Output In RStudio:
> n_p<-'नाम'
>
> Encoding(n_p)
[1] "UTF-8"
>
> gregexpr(n_p,c('adfdafc','नाम adsfdfa'))
[[1]]
[1] -1
attr(,"match.length")
[1] -1
[[2]]
[1] 1
attr(,"match.length")
[1] 3
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)
Matrix products: default
locale:
[1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252
[3] LC_MONETARY=English_India.1252 LC_NUMERIC=C
[5] LC_TIME=English_India.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rJava_0.9-10
loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0
Output with R.exe in cmd (For debugging purposes. Rscript.exe gives a similar if not identical output)
> n_p<-'à☼"à☼_à☼r'
>
> Encoding(n_p)
[1] "latin1"
>
> gregexpr(n_p,c('adfdafc','à☼"à☼_à☼r adsfdfa'))
[[1]]
[1] -1
attr(,"match.length")
[1] -1
[[2]]
[1] 1
attr(,"match.length")
[1] 9
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)
Matrix products: default
locale:
[1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252
[3] LC_MONETARY=English_India.1252 LC_NUMERIC=C
[5] LC_TIME=English_India.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.5.0
I've tried changing locales but Sys.setlocale
refuses to work properly. In some cases, gregexpr
gives an error when it can't parse non ASCII code. And finally, when it does run without errors, it doesn't match regular expressions properly. I can't provide a reproducible example at the moment, but I will try to later.
Help.
Upvotes: 5
Views: 4079
Reputation: 121
The right answer is that you should run Rscript with the option --encoding=file encoding
There is no need to set locale, and as you probably found out, it doesn't work anyway. If your file is UTF-8: Rscript.exe --encoding=UTF-8 file.R
Upvotes: 7
Reputation: 18980
You need to ensure that R is running in a suitable locale:
Running rterm use: Sys.getlocale()
to find your current locale.
You can set your locale using:
Sys.setlocale(category = "LC_ALL", locale = "hi-IN")
# Try "hi-IN.UTF-8" too...
You can find locale names here, the MSDN, and here.
If you have the correct value, put the Sys.setlocale()
command in your ~/.Rprofile
.
References
Upvotes: 0