Fran Villamil
Fran Villamil

Reputation: 521

Text encoding issues in R

I'm doing text mining in R with Spanish documents and I'm encountering constant issues with the encoding, and the different solutions I came up with to solve them. I have been searching for different topics, but can't really find a clear solution. And the fact that things work different every time probably means that I don't really understand the problem.

I extracted text data from a PDF using pdf_text (package pdftools), and the characters with tildes are translated into Unicode, e.g. "<U+00ED>". However, when I try to substitute these with the normal characters using gsub(or finding them with grepl), R doesn't find anything. The output looks something like this:

> txt
[1] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco"                              
[2] "Provincia: <U+00C1>lava"                                                   
[3] "Alda se extingue y su territorio se incorpora a Valle de Arana. Censo 1950"
> str(txt)
 chr [1:3] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco" ...
> grepl("<U\\+00F3>", txt)
[1] FALSE FALSE FALSE
> grepl("<U+00F3>", txt)
[1] FALSE FALSE FALSE
> gsub("<U\\+00F3>", "o", txt)
[1] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco"                              
[2] "Provincia: <U+00C1>lava"                                                   
[3] "Alda se extingue y su territorio se incorpora a Valle de Arana. Censo 1950"

However, if you introduce these stringrs manually, R does find them and substitutions are possible:

> txt = c("Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco", "Provincia: <U+00C1>lava")
> str(txt)
 chr [1:2] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco" ...
> grepl("<U\\+00F3>", txt)
[1]  TRUE FALSE
> gsub("<U\\+00F3>", "o", txt)
[1] "Comunidad Autonoma: Pa<U+00ED>s Vasco"
[2] "Provincia: <U+00C1>lava" 

Why is this happening? What is actually reading R so that it sees both things as different?

In the past I use to build an adapt function that basically substitutes these stuff for the normal characters, but I'm findings that in some cases it doesn't work, and I don't really understand why. Also, when these data comes from OCR, the mess is even bigger, and the translation to UTF-8 differs constantly, etc.

Does anyone know of any general approach that solves this stuff? I'll be working extensively with this in the future.

Thanks a lot.

P.S.:

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] pdftools_1.4  stringr_1.2.0

loaded via a namespace (and not attached):
[1] compiler_3.4.1 magrittr_1.5   tools_3.4.1    Rcpp_0.12.12  
[5] stringi_1.1.5 

Upvotes: 2

Views: 2002

Answers (1)

Patrick Perry
Patrick Perry

Reputation: 1482

The problem here is that your locale is set to C, so R will refuse to print non-ASCII characters. If you change your locale to one that allows printing Unicode, then you will see the characters that you expect.

txt <- "Comunidad Aut\u00F3noma: Pa\u00EDs Vasco"

Sys.setlocale("LC_CTYPE", "C") # switch character type locale to "C"
## "C"

print(txt)
## [1] "Comunidad Aut<U+00F3>noma: Pa<U+00ED>s Vasco"

Sys.setlocale("LC_CTYPE", "") # switch to native locale
## [1] "en_US.UTF-8"

print(txt)
## [1] "Comunidad Autónoma: País Vasco"

Here's how to replace the "o with acute accent" character:

gsub("\u00F3", "o", txt)
## [1] "Comunidad Autonoma: País Vasco"

Upvotes: 3

Related Questions