Reputation: 940
I have a Quanteda Corpus of 10 documents several of which are by the same author. I store the author in a separate docvar column - myCorpus$documents[,"author"]
> docvars(myCorpus)
author
206035 author1
269823 author2
304225 author1
422364 author2
<...snip..>
I'm charting a Lexical Dispersion Plot with xplot_xray,
textplot_xray(
kwic(myCorpus, "image"),
kwic(myCorpus, "one"),
kwic(myCorpus, "like"),
kwic(myCorpusus, "time"),
kwic(myCorpus, "just"),
scale = "absolute"
)
How can I use myCorpus$documents[,"author"]
as the document identifier instead of the Document ID?
I'm not trying to group the docs, I just want to identify the document by the author. I recognize that Doc IDs need to be unique so can't simply rename the docs with docnames(myCorpus)<-
Upvotes: 1
Views: 303
Reputation: 14902
The textplot document names are taken from the docnames
of the corpus. In this case, you want to create new documents grouped by the author
docvar. This can be accomplished using the texts()
extractor function and its groups
argument.
To create a reproducible example, I will use the built-in data object data_char_sampletext
, and segment this into sentences to form the new documents, and then simulate the author docvar.
library("quanteda")
# quanteda version 1.0.0
myCorpus <- corpus(data_char_sampletext) %>%
corpus_reshape(to = "sentences")
# make some duplicated author docvar values
set.seed(1)
docvars(myCorpus, "author") <-
sample(c("author1", "author2", "author3"),
size = ndoc(myCorpus), replace = TRUE)
This produces:
summary(myCorpus)
# Corpus consisting of 15 documents:
#
# Text Types Tokens Sentences author
# text1.1 23 23 1 author1
# text1.2 40 53 1 author2
# text1.3 48 63 1 author2
# text1.4 30 39 1 author3
# text1.5 20 25 1 author1
# text1.6 43 57 1 author3
# text1.7 13 15 1 author3
# text1.8 25 26 1 author2
# text1.9 9 9 1 author2
# text1.10 37 53 1 author1
# text1.11 32 41 1 author1
# text1.12 30 30 1 author1
# text1.13 28 35 1 author3
# text1.14 16 18 1 author2
# text1.15 32 42 1 author3
#
# Source: /Users/kbenoit/tmp/* on x86_64 by kbenoit
# Created: Fri Feb 16 18:03:13 2018
# Notes: corpus_reshape.corpus(., to = "sentences")
Now, we extract the texts as a character vector, grouping these by the author
document variable. This produces a named character vector of length 3, where the names are the (unique) author identifiers.
groupedtexts <- texts(myCorpus, groups = "author")
length(groupedtexts)
# [1] 3
names(groupedtexts)
# [1] "author1" "author2" "author3"
Then (as an illustration):
textplot_xray(
kwic(groupedtexts, "and"),
kwic(groupedtexts, "for")
)
Upvotes: 1