
Reputation: 3

Get around deprecated quanteda texts() function

I am trying to replicate this paper

In the tokens.R script it's cleaning up the corpus with the following command:

texts(corp) <- stri_replace_all_regex(texts(corp), "^[\\p{Lu}\\p{Z}]+(.{0,30}?)(\\(.{0,50}?\\))?(--)", "")

Which yields the following error message:

Error in attributes(.Data) <- c(attributes(.Data), attrib) : 
  'names' attribute [387896] must be the same length as the vector [4]
In addition: Warning message:
'texts.corpus' ist veraltet.
Benutzen Sie stattdessen 'as.character'
Siehe help("Deprecated") 

So I naively apply the 'as.character' function like this:

as.character(corp) <- stri_replace_all_regex(as.character(corp), "^[\\p{Lu}\\p{Z}]+(.{0,30}?)(\\(.{0,50}?\\))?(--)", "")

Which yields the following error

Error in attributes(.Data) <- c(attributes(.Data), attrib) : 
  'names' attribute [387896] must be the same length as the vector [4]

I tried some other things, like only adressing $documents within the corpus or turning the corpus into a vector but none of that really worked.

How can I get around this?

Thank you in advance.

Upvotes: 0

Views: 38

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

The "corpus" being loaded in the linked .R file tokens.R is using a very old format corpus object (from data/corpus_nytimes_summary.RDS).

You can convert this into a new format corpus using:

corp <- corpus(corp)

Then replace the texts using this approach:

corp[] <- stri_replace_all_regex(corp, "^[\\p{Lu}\\p{Z}]+(.{0,30}?)(\\(.{0,50}?\\))?(--)", "")

The use of corp[] replaces the character part of corp without stripping the additional attributes (metadata and docvars) that make the character object corp a quanteda corpus.

Upvotes: 0

Related Questions