Reputation: 35
How can I turn the output of kwic into a corpus for further analysis? More specifically, I want to create a corpus based on the words coming before and after a keyword (contextPre, contextPost) to do further sentiment analysis on them.
Upvotes: 1
Views: 925
Reputation: 14902
Simplest way: create a pre-context and a post-context corpus, with a document variable (docvar
) identifying the context, and then merge the two corpora with a +
operation.
require(quanteda)
mykwic <- kwic(data_corpus_inaugural, "terror")
# make a corpus with the pre-word context
mycorpus <- corpus(mykwic$pre)
docvars(mycorpus, "context") <- "pre"
# make a corpus with the post-word context
mycorpus2 <- corpus(mykwic$post)
docvars(mycorpus2, "context") <- "post"
# combine the two corpora
mycorpus <- mycorpus + mycorpus2
summary(mycorpus)
# Corpus consisting of 16 documents.
#
# Text Types Tokens Sentences context
# text1 5 5 1 pre
# text2 4 5 1 pre
# text3 5 5 1 pre
# text4 5 5 1 pre
# text5 5 5 1 pre
# text6 5 5 1 pre
# text7 5 5 1 pre
# text8 5 5 1 pre
# text11 4 5 1 post
# text21 5 5 1 post
# text31 5 5 1 post
# text41 5 5 1 post
# text51 5 5 1 post
# text61 5 5 2 post
# text71 5 5 2 post
# text81 5 5 1 post
#
# Source: Combination of corpuses mycorpus and mycorpus2
# Created: Wed May 25 23:35:54 2016
# Notes:
Added:
As of v0.9.7-6, quanteda has a method to construct a corpus
directly from a kwic
object. So this now works:
mykwic <- kwic(data_corpus_inaugural, "southern")
summary(corpus(mykwic))
# Corpus consisting of 28 documents.
#
# Text Types Tokens Sentences docname position keyword context
# text1.pre 5 5 1 1797-Adams 1807 southern pre
# text2.pre 4 5 1 1825-Adams 2434 southern pre
# text3.pre 4 5 1 1861-Lincoln 98 Southern pre
# text4.pre 5 5 1 1865-Lincoln 283 southern pre
# text5.pre 5 5 1 1877-Hayes 378 Southern pre
# text6.pre 5 5 1 1877-Hayes 956 Southern pre
# text7.pre 5 5 1 1877-Hayes 1250 Southern pre
# text8.pre 5 5 1 1881-Garfield 1007 Southern pre
# text9.pre 4 5 1 1909-Taft 4029 Southern pre
# text10.pre 5 5 1 1909-Taft 4230 Southern pre
# text11.pre 5 5 1 1909-Taft 4350 Southern pre
# text12.pre 5 5 1 1909-Taft 4537 Southern pre
# text13.pre 5 5 1 1909-Taft 4597 Southern pre
# text14.pre 5 5 1 1953-Eisenhower 1226 southern pre
# text1.post 5 5 1 1797-Adams 1807 southern post
# text2.post 5 5 1 1825-Adams 2434 southern post
# text3.post 5 5 1 1861-Lincoln 98 Southern post
# text4.post 5 5 2 1865-Lincoln 283 southern post
# text5.post 5 5 2 1877-Hayes 378 Southern post
# text6.post 5 5 1 1877-Hayes 956 Southern post
# text7.post 5 5 1 1877-Hayes 1250 Southern post
# text8.post 5 5 2 1881-Garfield 1007 Southern post
# text9.post 5 5 2 1909-Taft 4029 Southern post
# text10.post 5 5 1 1909-Taft 4230 Southern post
# text11.post 5 5 1 1909-Taft 4350 Southern post
# text12.post 5 5 1 1909-Taft 4537 Southern post
# text13.post 5 5 1 1909-Taft 4597 Southern post
# text14.post 5 5 1 1953-Eisenhower 1226 southern post
#
# Source: Corpus created from kwic(x, keywords = "southern")
# Created: Thu May 26 09:47:19 2016
# Notes:
Upvotes: 2