m45ha
m45ha

Reputation: 405

custom dictionaries in quanteda

I need to do LIWC(Linguistic Inquiry and Word Count)and I am using quanteda/quanteda.dictionaries. I need to "load" custom dictionaries: i saved my word lists as individual .txt files and a "load" the through readlines (example with just one dictionary):

autonomy = readLines("Dictionary/autonomy.txt", encoding = "UTF-8")

EODic<-quanteda::dictionary(list(autonomy=autonomy),encoding = "auto")

This is a text that I am trying it on

txt <- c("12th Battalion Productions is producing a fully holographic feature length production. Presenting a 3D audio-visual projection without a single cast member present, to give the illusion of live stage performance.")

Then I run it:

liwcalike(txt, EODic, what = "word")

and get this error:

Error in stri_replace_all_charclass(value, "\\p{Z}", concatenator) : 


invalid UTF-8 byte sequence detected; perhaps you should try calling stri_enc_toutf8()

Obviously, the problem is with my txt file. I have quite a few dictionaries and rather load them as files.

How can I fix this error? specifying encoding in readlines does not seem to help

Here is the file https://drive.google.com/file/d/12plgfJdMawmqTkcLWxD1BfWdaeHuPTXV/view?usp=sharing

Update: the easiest way to solve this on Mac was to open the .txt file in Word rather than TextEdit. Word gives options for encoding unlike default TextEdit!

Upvotes: 1

Views: 668

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

OK, the problem is not an encoding one, since everything in the file you linked could be encoded entirely in the lower-128 character ASCII. The problem was the blanks caused by empty lines. There are also leading spaces that need removal. This is easy to do using some subsetting and some stringi cleanup operations.

library("quanteda")
## Package version: 1.3.14

autonomy <- readLines("~/Downloads/risktaking.txt", encoding = "UTF-8")
head(autonomy, 15)
##  [1] "adventuresome"  " adventurous"   " audacious"     " bet"          
##  [5] " bold"          " bold-spirited" " brash"         " brave"        
##  [9] " chance"        " chancy"        " courageous"    " danger"       
## [13] ""               "dangerous"      " dare"

# strip leading or trailing whitespace
autonomy <- stringi::stri_trim_both(autonomy)
# get rid of empties
autonomy <- autonomy[!autonomy == ""]

Now you can create the dictionary and apply the quanteda.dictionaries::liwcalike() function.

# now define the quanteda dictionary
EODic <- dictionary(list(autonomy = autonomy))

txt <- c("12th Battalion Productions is producing a fully holographic feature length production. Presenting a 3D audio-visual projection without a single cast member present, to give the illusion of live stage performance.")

library("quanteda.dictionaries")
liwcalike(txt, dictionary = EODic)
##   docname Segment WC  WPS Sixltr Dic autonomy AllPunc Period Comma Colon
## 1   text1       1 35 15.5  34.29   0        0   11.43   5.71  2.86     0
##   SemiC QMark Exclam Dash Quote Apostro Parenth OtherP
## 1     0     0      0 2.86     0       0       0   8.57

Upvotes: 2

Related Questions