Reputation: 51
When I run the app, I getting the following error.
Error in FUN: invalid input 'at my monthly blog stats and we’re nearly on 4000 for April which is amazing – thank you Jx 😘😘' in 'utf8towcs'
I tried to covert the data as below because of emotions etc in blogs.txt file.
blogs<-iconv(blogs, "latin1", "ASCII", sub="")
news<-iconv(news, "latin1", "ASCII", sub="")
twitter<-iconv(twitter, "latin1", "ASCII", sub="")
and also using icon function as below,
corpus <- VCorpus(VectorSource(data.sample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
tospace <- tm_map(corpus,
content_transformer(function(x)
iconv(x, to="UTF-8", sub="byte")),
mc.cores=1)
Still, I am getting the issue.
Please help in this regard.
Session info:
====================
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringr_1.2.0 shiny_1.0.5 slam_0.1-40 ggplot2_2.2.1 RWeka_0.4-35 tm_0.7-1 NLP_0.1-11
[8] stringi_1.1.5
loaded via a namespace (and not attached):
[1] Rcpp_0.12.13 magrittr_1.5 RWekajars_3.9.1-4 munsell_0.4.3 colorspace_1.3-2
[6] xtable_1.8-2 R6_2.2.2 rlang_0.1.4 plyr_1.8.4 tools_3.4.2
[11] parallel_3.4.2 grid_3.4.2 gtable_0.2.0 htmltools_0.3.6 yaml_2.1.14
[16] lazyeval_0.2.1 digest_0.6.12 tibble_1.3.4 rJava_0.9-9 rsconnect_0.8.5
[21] mime_0.5 compiler_3.4.2 scales_0.5.0 jsonlite_1.5 httpuv_1.3.5
Upvotes: 0
Views: 4997
Reputation: 41
Try to convert to general unicode using 'stringi' package, then reconvert to a corpus input. This process is necessary because the stri_trans_general function will transform your data in a vector of characters.
library(stringi)
corpus <- stri_trans_general(corpus, "latin-ascii")
corpus <- Corpus(VectorSource(corpus))
Upvotes: 0
Reputation: 5003
Your problem cleary has to do with that the Data is not encoded in UTF-8
there are many ways to ensure this.
iconv
with `to = "UTF-8"enc2utf8()
If you are running you app on a windows PC during development you might have to tell the computer the encoding is UTF-8
with
Encoding(blogs) <- "UTF-8"
Upvotes: 1