Reputation: 11
I am trying to vectorize my text data using R's tm package.
Right now my data corpus is in the following form:
1. The sports team practiced today
2. The soccer team went took the day off
then the data would get vectorized into:
<the, sports, team, practiced, today, soccer, went, took, off>
1. <1, 1, 1, 1, 1, 0, 0, 0, 0>
2. <1, 0, 1, 0, 0, 1, 1, 1, 1>
I would prefer to use a group of custom phrases for my vector, such as:
<sports team, soccer team, practiced today, day off>
1. <1, 0, 1, 0>
2. <0, 1, 0, 1>
Is there a package or function in R that will do this? Or are there any other open-source resources that have similar functionality? Thank you.
Upvotes: 1
Views: 1019
Reputation: 14902
You asked about other text packages, I welcome you to try quanteda
, which I developed with Paul Nulty.
In the code below, first you define the multi-word phrases you want, as a named list that is typed as a quanteda "dictionary" class object using the dictionary()
constructor, and then you use phrasetotoken()
to convert the phrases found in your texts into single "tokens" consisting of the phrasal words concatenated by underscores. The tokeniser ignored underscores so your phrases are then treated as if they were single-word tokens.
dfm()
is the constructor for document-feature matrices, can take a regular expressions that define the features to keep, here any phrase containing an underscore character (the regex could of course be refined but I've kept it deliberately simple here). dfm()
has a lot of options -- see ?dfm
.
install.packages("quanteda")
library(quanteda)
mytext <- c("The sports team practiced today",
"The soccer team went took the day off")
myphrases <- dictionary(list(myphrases=c("sports team", "soccer team", "practiced today", "day off")))
mytext2 <- phrasetotoken(mytext, myphrases)
mytext2
## [1] "The sports_team practiced_today" "The soccer_team went took the day_off"
# keptFeatures is a regular expression: keep only phrases
mydfm <- dfm(mytext2, keptFeatures = "_", verbose=FALSE)
mydfm
## Document-feature matrix of: 2 documents, 4 features.
## 2 x 4 sparse Matrix of class "dfmSparse"
## features
## docs day_off practiced_today soccer_team sports_team
## text1 0 1 0 1
## text2 1 0 1 0
Happy to help with any quanteda
-related questions, including feature requests if you can suggest improvements on phrase handling.
Upvotes: 2
Reputation: 7664
How about something like this?
library(tm)
text <- c("The sports team practiced today", "The soccer team went took the day off")
corpus <- Corpus(VectorSource(text))
tokenizing.phrases <- c("sports team", "soccer team", "practiced today", "day off")
phraseTokenizer <- function(x) {
require(stringr)
x <- as.character(x) # extract the plain text from the tm TextDocument object
x <- str_trim(x)
if (is.na(x)) return("")
phrase.hits <- str_detect(x, ignore.case(tokenizing.phrases))
if (any(phrase.hits)) {
# only split once on the first hit, so we don't have to worry about multiple occurences of the same phrase
split.phrase <- tokenizing.phrases[which(phrase.hits)[1]]
# warning(paste("split phrase:", split.phrase))
temp <- unlist(str_split(x, ignore.case(split.phrase), 2))
out <- c(phraseTokenizer(temp[1]), split.phrase, phraseTokenizer(temp[2])) # this is recursive, since f() calls itself
} else {
out <- MC_tokenizer(x)
}
# get rid of any extraneous empty strings, which can happen if a phrase occurs just before a punctuation
out[out != ""]
}
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = phraseTokenizer))
> Terms(tdm)
[1] "day off" "practiced today" "soccer team" "sports team" "the" "took"
[7] "went"
Upvotes: 0