Colin
Colin

Reputation: 940

Quanteda textplot_xray grouped by non-unique docvar as document

I have a Quanteda Corpus of 10 documents several of which are by the same author. I store the author in a separate docvar column - myCorpus$documents[,"author"]

> docvars(myCorpus)

          author   
206035    author1   
269823    author2   
304225    author1   
422364    author2
<...snip..>

I'm charting a Lexical Dispersion Plot with xplot_xray,

textplot_xray(
            kwic(myCorpus, "image"),
            kwic(myCorpus, "one"),
            kwic(myCorpus, "like"),
            kwic(myCorpusus, "time"),
            kwic(myCorpus, "just"),
            scale = "absolute"
          )

textplot_xray

How can I use myCorpus$documents[,"author"] as the document identifier instead of the Document ID?

I'm not trying to group the docs, I just want to identify the document by the author. I recognize that Doc IDs need to be unique so can't simply rename the docs with docnames(myCorpus)<-

Upvotes: 1

Views: 303

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

The textplot document names are taken from the docnames of the corpus. In this case, you want to create new documents grouped by the author docvar. This can be accomplished using the texts() extractor function and its groups argument.

To create a reproducible example, I will use the built-in data object data_char_sampletext, and segment this into sentences to form the new documents, and then simulate the author docvar.

library("quanteda")
# quanteda version 1.0.0

myCorpus <- corpus(data_char_sampletext) %>% 
    corpus_reshape(to = "sentences")
# make some duplicated author docvar values
set.seed(1)
docvars(myCorpus, "author") <- 
    sample(c("author1", "author2", "author3"), 
           size = ndoc(myCorpus), replace = TRUE)

This produces:

summary(myCorpus)
# Corpus consisting of 15 documents:
#     
#     Text Types Tokens Sentences  author
#  text1.1    23     23         1 author1
#  text1.2    40     53         1 author2
#  text1.3    48     63         1 author2
#  text1.4    30     39         1 author3
#  text1.5    20     25         1 author1
#  text1.6    43     57         1 author3
#  text1.7    13     15         1 author3
#  text1.8    25     26         1 author2
#  text1.9     9      9         1 author2
# text1.10    37     53         1 author1
# text1.11    32     41         1 author1
# text1.12    30     30         1 author1
# text1.13    28     35         1 author3
# text1.14    16     18         1 author2
# text1.15    32     42         1 author3
# 
# Source:  /Users/kbenoit/tmp/* on x86_64 by kbenoit
# Created: Fri Feb 16 18:03:13 2018
# Notes:   corpus_reshape.corpus(., to = "sentences") 

Now, we extract the texts as a character vector, grouping these by the author document variable. This produces a named character vector of length 3, where the names are the (unique) author identifiers.

groupedtexts <- texts(myCorpus, groups = "author")
length(groupedtexts)
# [1] 3
names(groupedtexts)
# [1] "author1" "author2" "author3"

Then (as an illustration):

textplot_xray(
    kwic(groupedtexts, "and"),
    kwic(groupedtexts, "for")
)

enter image description here

Upvotes: 1

Related Questions