reading Tamil corpus in R

Question

I am have built a basic word prediction product using R as part of an online course project work. I wanted to extend it for predicting next word from Tamil phases. I had used sample of Tamil language corpora from HC Corpora website. I have read it into R and created a tm() corpus.

testData <- "திருவண்ணாமலை, கொல்லிமலை, சதுரகிரி என அவன் சித்தர்களை பல 
        இடங்களில், மலைகளில், குகைகளில், இன்னும் பல ரகசிய இடங்களில்
        அவன் சித்தர்களை சந்தித்து பல நம்பமுடியாத சக்திகளைப்
        பெற்றுவிட்டான் என்று சொல்லிக் கொள்கிறார்கள்"
getUnigrams <- function(x) {NGramTokenizer(x, 
                            Weka_control(min=1, max=1))}
unigrams <- DocumentTermMatrix(VCorpus(VectorSource(testData)),
                               control=list(tokenize=getUnigrams))
unigramsList <- data.frame(slam::col_sums(unigrams))
head(unigramsList, 3)

>         slam..col_sums.unigrams.
அவன்                            2
இடங்களில்                        2
இன்னும்                          1

The actual Tamil words are row names of this data-frame and displayed properly on the screen. However, when I try to add it as column against their respective count, the resulting data frame does not displays the Tamil words correctly in column unigramsList$word1. It displays it as unicode characters of underlying Tamil word.

    unigramsList$word1 <- rownames(unigramsList) ## Encoding issues arise from here!!!
head(unigramsList, 3)

slam..col_sums.unigrams.
அவன்                            2
இடங்களில்                        2
இன்னும்                          1
                                                                           word1
அவன்                                             
இடங்களில் 
இன்னும்                   
>

I tried to continue with these unicode characters and built n-grams for 2, 3 and 4-grams and used it for my prediction. But all subsequent operations on this column are displayed as raw unicode only. I want to be able to view and predict them in their native Tamil characters.

My session information is as below:

> sessionInfo()
R version 3.2.5 (2016-04-14)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RWeka_0.4-29  tm_0.6-2      NLP_0.1-9     stringi_1.0-1 stringr_1.0.0

loaded via a namespace (and not attached):
[1] magrittr_1.5      parallel_3.2.5    tools_3.2.5       slam_0.1-37      
[5] grid_3.2.5        rJava_0.9-8       RWekajars_3.9.0-1

reading Tamil corpus in R

Answers (1)

Related Questions