Periasamy Ramamoorthy
Periasamy Ramamoorthy

Reputation: 31

reading Tamil corpus in R

I am have built a basic word prediction product using R as part of an online course project work. I wanted to extend it for predicting next word from Tamil phases. I had used sample of Tamil language corpora from HC Corpora website. I have read it into R and created a tm() corpus.

testData <- "திருவண்ணாமலை, கொல்லிமலை, சதுரகிரி என அவன் சித்தர்களை பல 
        இடங்களில், மலைகளில், குகைகளில், இன்னும் பல ரகசிய இடங்களில்
        அவன் சித்தர்களை சந்தித்து பல நம்பமுடியாத சக்திகளைப்
        பெற்றுவிட்டான் என்று சொல்லிக் கொள்கிறார்கள்"
getUnigrams <- function(x) {NGramTokenizer(x, 
                            Weka_control(min=1, max=1))}
unigrams <- DocumentTermMatrix(VCorpus(VectorSource(testData)),
                               control=list(tokenize=getUnigrams))
unigramsList <- data.frame(slam::col_sums(unigrams))
head(unigramsList, 3)

>         slam..col_sums.unigrams.
அவன்                            2
இடங்களில்                        2
இன்னும்                          1

The actual Tamil words are row names of this data-frame and displayed properly on the screen. However, when I try to add it as column against their respective count, the resulting data frame does not displays the Tamil words correctly in column unigramsList$word1. It displays it as unicode characters of underlying Tamil word.

    unigramsList$word1 <- rownames(unigramsList) ## Encoding issues arise from here!!!
head(unigramsList, 3)

slam..col_sums.unigrams.
அவன்                            2
இடங்களில்                        2
இன்னும்                          1
                                                                           word1
அவன்                                             <U+0B85><U+0BB5><U+0BA9><U+0BCD>
இடங்களில் <U+0B87><U+0B9F><U+0B99><U+0BCD><U+0B95><U+0BB3><U+0BBF><U+0BB2><U+0BCD>
இன்னும்                   <U+0B87><U+0BA9><U+0BCD><U+0BA9><U+0BC1><U+0BAE><U+0BCD>
> 

I tried to continue with these unicode characters and built n-grams for 2, 3 and 4-grams and used it for my prediction. But all subsequent operations on this column are displayed as raw unicode only. I want to be able to view and predict them in their native Tamil characters.

My session information is as below:

> sessionInfo()
R version 3.2.5 (2016-04-14)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RWeka_0.4-29  tm_0.6-2      NLP_0.1-9     stringi_1.0-1 stringr_1.0.0

loaded via a namespace (and not attached):
[1] magrittr_1.5      parallel_3.2.5    tools_3.2.5       slam_0.1-37      
[5] grid_3.2.5        rJava_0.9-8       RWekajars_3.9.0-1

Upvotes: 2

Views: 324

Answers (1)

Periasamy Ramamoorthy
Periasamy Ramamoorthy

Reputation: 31

I managed to hack a solution to above and hence thought of posting it for anyone interested in this topic.

a) Instead of saving the n-grams as csv files on Windows, I saved them in R binary format (using save() and load() functions). I had saved the generated n-grams using read.csv() with fileEncoding option set to UTF-8, but still it did not help even after deploying it on Shiny.

b) Deployed and tested on Shiny apps, which runs on a Linux platform and hence it was able to display Tamil characters in unicode correctly. Testing it locally on Windows was not effective as characters were displayed as raw unicodes e.g. , etc.

Thanks to Marek Gagolewski, author of stringi, for suggestions regarding shinyio, which helped me deploy and test on shiny's Linux platform.

You can check out the product using the below link if you are interested: https://periasamyr.shinyapps.io/predictwordml/

Regards

Peri

Upvotes: 1

Related Questions