Nick Olczak
Nick Olczak

Reputation: 315

Can you add custom tokens to tokenizer (Chinese language) in Quanteda?

Does anybody know if it is possible to add in custom tokens after texts have been tokenized in Quanteda?

I am trying to do some analysis of Chinese language texts, but the tokenizer doesn't recognise the abbreviation for ASEAN "东盟” as a single word (see eg below).

Or alternatively, are there any alternative tokenizers for Chinese language texts that work with Quanteda. I had been using the Spacyr package, but cannot get that working at the moment.

I had made some functions to use the 'Feature co-occurrence matrix' to count the numbers of times other words appears within a 5-word window of a particular term, then to produce a table of these results (see below). However, this doesn't seem to work for the term "东盟”


##Function 1 

get_fcm <- function(data) {
  ch_stop <- stopwords("zh", source = "misc")
  corp = corpus(data)
  toks = tokens(corp, remove_punct = TRUE) %>% tokens_remove(ch_stop) 
  fcm = fcm(toks, context = "window")
  return(fcm)
}

##Function 2

convert2df <- function(matrix, term){
  mat_term = matrix[term,]
  df = convert((t(mat_term)), to = "data.frame")
  colnames(df)[1] = "CoTerm"
  colnames(df)[2] = "Freq"
  x = df[order(-df$Freq),]
  return(x)
}

Would adding %>% tokens_compound(phrase("东 盟"), concatenator = "") to the toks = line of Function 1 resolve this?

Upvotes: 1

Views: 290

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

You can post-process the split phrases such as "东盟" to rejoin them after tokenising, if you have a specific list.

> tokens("东盟") %>%
+     tokens_compound(phrase("东 盟"), concatenator = "")
Tokens consisting of 1 document.
text1 :
[1] "东盟"

Upvotes: 1

Related Questions