Reputation: 521
I'm doing text mining in R with Spanish documents and I'm encountering constant issues with the encoding, and the different solutions I came up with to solve them. I have been searching for different topics, but can't really find a clear solution. And the fact that things work different every time probably means that I don't really understand the problem.
I extracted text data from a PDF using pdf_text
(package pdftools
), and the characters with tildes are translated into Unicode, e.g. "<U+00ED>"
. However, when I try to substitute these with the normal characters using gsub
(or finding them with grepl
), R doesn't find anything. The output looks something like this:
> txt
[1] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco"
[2] "Provincia: <U+00C1>lava"
[3] "Alda se extingue y su territorio se incorpora a Valle de Arana. Censo 1950"
> str(txt)
chr [1:3] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco" ...
> grepl("<U\\+00F3>", txt)
[1] FALSE FALSE FALSE
> grepl("<U+00F3>", txt)
[1] FALSE FALSE FALSE
> gsub("<U\\+00F3>", "o", txt)
[1] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco"
[2] "Provincia: <U+00C1>lava"
[3] "Alda se extingue y su territorio se incorpora a Valle de Arana. Censo 1950"
However, if you introduce these stringrs manually, R does find them and substitutions are possible:
> txt = c("Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco", "Provincia: <U+00C1>lava")
> str(txt)
chr [1:2] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco" ...
> grepl("<U\\+00F3>", txt)
[1] TRUE FALSE
> gsub("<U\\+00F3>", "o", txt)
[1] "Comunidad Autonoma: Pa<U+00ED>s Vasco"
[2] "Provincia: <U+00C1>lava"
Why is this happening? What is actually reading R so that it sees both things as different?
In the past I use to build an adapt function that basically substitutes these stuff for the normal characters, but I'm findings that in some cases it doesn't work, and I don't really understand why. Also, when these data comes from OCR, the mess is even bigger, and the translation to UTF-8 differs constantly, etc.
Does anyone know of any general approach that solves this stuff? I'll be working extensively with this in the future.
Thanks a lot.
P.S.:
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] pdftools_1.4 stringr_1.2.0
loaded via a namespace (and not attached):
[1] compiler_3.4.1 magrittr_1.5 tools_3.4.1 Rcpp_0.12.12
[5] stringi_1.1.5
Upvotes: 2
Views: 2002
Reputation: 1482
The problem here is that your locale is set to C
, so R will refuse to print non-ASCII characters. If you change your locale to one that allows printing Unicode, then you will see the characters that you expect.
txt <- "Comunidad Aut\u00F3noma: Pa\u00EDs Vasco"
Sys.setlocale("LC_CTYPE", "C") # switch character type locale to "C"
## "C"
print(txt)
## [1] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco"
Sys.setlocale("LC_CTYPE", "") # switch to native locale
## [1] "en_US.UTF-8"
print(txt)
## [1] "Comunidad Autónoma: País Vasco"
Here's how to replace the "o with acute accent" character:
gsub("\u00F3", "o", txt)
## [1] "Comunidad Autonoma: País Vasco"
Upvotes: 3