R: quanteda removing tags from corpus

Question

I am working with a number texts using the quanteda package. My texts contain tags in them, some with unique values like URLs. I want remove not only the tags but everything inside the tags as well.

Example:

I'm not sure how to remove them while working with the quanteda package. It seems to me like the dfm function would be the place to use it, I don't think stopwords will work because of the unique URLs. I can use the following gsub with regex expression to successfully target the tags I want to remove:

x <- gsub("<.*?>", "", y)

I've gone through the gfm documentation and have tried a few things with the remove and value type arguments, but perhaps I don't understand the documentation very well.

Also as shown by the answer in this question I tried the dfm_select function but no dice as well.

Here is my code:

library(readtext)
library(quanteda)

#create directory
data_dir <- list.files(pattern="*.txt", recursive = TRUE, full.names = TRUE)

#create corpus    
micusp_corpus <- corpus(readtext(data_dir))

#add field 'region'
docvars(micusp_corpus, "Region") <- gsub("(\w{6})\..*?$", "", rownames(micusp_corpus$documents))

#create document feature matrix
micusp_dfm <- dfm(micusp_corpus, groups = "Region", remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE)
 #try to remove tags       
micusp_dfm <- dfm_select(micusp_dfm, "<.*?>", selection = "remove", valuetype = "regex")

#show top tokens (note the appearence of the tag content "oa")
textstat_frequency(micusp_dfm, n=10)

Ken Benoit · Accepted Answer

While your question does not provide a reproducible example, I think I can help. You want to clean the texts that go into your corpus, before you reach the dfm construction stage. Replace the #create corpus line with this:

# read texts, remove tags, and create the corpus
tmp <- readtext(data_dir)
tmp$text <- gsub("<.*?>", "", tmp$text)
micusp_corpus <- corpus(tmp)

R: quanteda removing tags from corpus

Answers (1)

Related Questions