Text encoding issues in R

Question

I'm doing text mining in R with Spanish documents and I'm encountering constant issues with the encoding, and the different solutions I came up with to solve them. I have been searching for different topics, but can't really find a clear solution. And the fact that things work different every time probably means that I don't really understand the problem.

I extracted text data from a PDF using pdf_text (package pdftools), and the characters with tildes are translated into Unicode, e.g. "". However, when I try to substitute these with the normal characters using gsub(or finding them with grepl), R doesn't find anything. The output looks something like this:

> txt
[1] "Comunidad Autnoma: Pas Vasco"                              
[2] "Provincia: lava"                                                   
[3] "Alda se extingue y su territorio se incorpora a Valle de Arana. Censo 1950"
> str(txt)
 chr [1:3] "Comunidad Autnoma: Pas Vasco" ...
> grepl("", txt)
[1] FALSE FALSE FALSE
> grepl("", txt)
[1] FALSE FALSE FALSE
> gsub("", "o", txt)
[1] "Comunidad Autnoma: Pas Vasco"                              
[2] "Provincia: lava"                                                   
[3] "Alda se extingue y su territorio se incorpora a Valle de Arana. Censo 1950"

However, if you introduce these stringrs manually, R does find them and substitutions are possible:

> txt = c("Comunidad Autnoma: Pas Vasco", "Provincia: lava")
> str(txt)
 chr [1:2] "Comunidad Autnoma: Pas Vasco" ...
> grepl("", txt)
[1]  TRUE FALSE
> gsub("", "o", txt)
[1] "Comunidad Autonoma: Pas Vasco"
[2] "Provincia: lava"

Why is this happening? What is actually reading R so that it sees both things as different?

In the past I use to build an adapt function that basically substitutes these stuff for the normal characters, but I'm findings that in some cases it doesn't work, and I don't really understand why. Also, when these data comes from OCR, the mess is even bigger, and the translation to UTF-8 differs constantly, etc.

Does anyone know of any general approach that solves this stuff? I'll be working extensively with this in the future.

Thanks a lot.

P.S.:

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] pdftools_1.4  stringr_1.2.0

loaded via a namespace (and not attached):
[1] compiler_3.4.1 magrittr_1.5   tools_3.4.1    Rcpp_0.12.12  
[5] stringi_1.1.5

Text encoding issues in R

Answers (1)

Related Questions