Mapping unicode characters to language in R

Question

I'm extracting data from a .pdf file which is in Tamil(an Indian local language) Language, After extracting the text in R from pdf file gives me some junk or unicode character format text. I'm unable to map it to proper text or the same text as it is in pdf file, Here is the code

library(tm)
library(pdftools)
library(qdapRegex)
library(stringr)
library(textreadr)

if(!require("ghit")){
  install.packages("ghit")
}
# on 64-bit Windows
ghit::install_github(c("ropenscilabs/tabulizerjars", "ropenscilabs/tabulizer"), INSTALL_opts = "--no-multiarch")
# elsewhere
ghit::install_github(c("ropenscilabs/tabulizerjars", "ropenscilabs/tabulizer"))
text <- extract_tables("D:/first.pdf")
 text[[1]][,2][3]

This gives me some junk character like

"Â«Ã®Ã¹Â£Ã±Â¢Â«Ã°Ã¬Â¢Â¬Ã¬  , Ã¢Ã´Â¢Ã¬Â£Ã±Â¢ÃºÂ¢ Â«Ã³Â£ Ì"

I tried with changing the unicode type

library(stringi)
stri_trans_toupper("ÃªÂ¶Ã³Â®", locale = "Tamil")

But no success though. Any suggestion will be appreciable.

Thanks.

Mapping unicode characters to language in R

Answers (1)

Related Questions