Jay
Jay

Reputation: 51

Invalid input error in utf8towcs when running R code in shiny

When I run the app, I getting the following error.

Error in FUN: invalid input 'at my monthly blog stats and we’re nearly on 4000 for April which is amazing – thank you Jx 😘😘' in 'utf8towcs'

I tried to covert the data as below because of emotions etc in blogs.txt file.

blogs<-iconv(blogs, "latin1", "ASCII", sub="")

news<-iconv(news, "latin1", "ASCII", sub="")

twitter<-iconv(twitter, "latin1", "ASCII", sub="")

and also using icon function as below,

Create corpus and clean the data

corpus <- VCorpus(VectorSource(data.sample))

toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))

corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")

tospace <- tm_map(corpus,

content_transformer(function(x)

iconv(x, to="UTF-8", sub="byte")),

mc.cores=1)

Still, I am getting the issue.

Please help in this regard.

Session info:

====================

R version 3.4.2 (2017-09-28)

Platform: x86_64-w64-mingw32/x64 (64-bit)

Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:

[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252

[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C

[5] LC_TIME=English_United States.1252

attached base packages:

[1] stats graphics grDevices utils datasets methods base

other attached packages:

[1] stringr_1.2.0 shiny_1.0.5 slam_0.1-40 ggplot2_2.2.1 RWeka_0.4-35 tm_0.7-1 NLP_0.1-11

[8] stringi_1.1.5

loaded via a namespace (and not attached):

[1] Rcpp_0.12.13 magrittr_1.5 RWekajars_3.9.1-4 munsell_0.4.3 colorspace_1.3-2

[6] xtable_1.8-2 R6_2.2.2 rlang_0.1.4 plyr_1.8.4 tools_3.4.2

[11] parallel_3.4.2 grid_3.4.2 gtable_0.2.0 htmltools_0.3.6 yaml_2.1.14

[16] lazyeval_0.2.1 digest_0.6.12 tibble_1.3.4 rJava_0.9-9 rsconnect_0.8.5

[21] mime_0.5 compiler_3.4.2 scales_0.5.0 jsonlite_1.5 httpuv_1.3.5

Upvotes: 0

Views: 4997

Answers (2)

Marcelo Tibau
Marcelo Tibau

Reputation: 41

Try to convert to general unicode using 'stringi' package, then reconvert to a corpus input. This process is necessary because the stri_trans_general function will transform your data in a vector of characters.

library(stringi)
corpus <- stri_trans_general(corpus, "latin-ascii")
corpus <- Corpus(VectorSource(corpus))

Upvotes: 0

Bertil Baron
Bertil Baron

Reputation: 5003

Your problem cleary has to do with that the Data is not encoded in UTF-8

there are many ways to ensure this.

  • make sure the original file is endoded in UTF-8 this can be done with notepad++ for example if it is a simple text-file
  • use iconv with `to = "UTF-8"
  • use enc2utf8()

If you are running you app on a windows PC during development you might have to tell the computer the encoding is UTF-8 with

Encoding(blogs) <- "UTF-8"

Upvotes: 1

Related Questions