Need help diagnosing cause of "Covariate matrix is singular" when estimating effect in structural topic model (stm)

Question

First things first. I've saved my workspace and you can load it with the following command: load(url("https://dl.dropboxusercontent.com/s/06oz5j41nif7la5/example.RData?dl=0"))

I have a number of abstract texts and I'm attempting to estimate a structural topic model to measure topic prevalence over time. The data contains a document id, abstract text, and year of publication.

I want to generate trends in expected topic proportion over time like the authors of the STM Vignette do here:

I'm able to create my topic model without issue, but when I attempt to run the estimateEffect() function from the stm package in R, I always get the following warning:

And my trends look like this:

In the documentation, the authors note that

The function will automatically check whether the covariate matrix is singular which generally results from linearly dependent columns. Some common causes include a factor variable with an unobserved level, a spline with degrees of freedom that are too high, or a spline with a continuous variable where a gap in the support of the variable results in several empty basis functions.

I've tried a variety of different models, using a 2-topic solution all the way up to 52-topic solution, always with the same result. If I remove the spline function from the "year" variable in my model and assume a linear fit, then estimateEffect() works just fine. So it must be an issue with the splined data. I just don't know what exactly.

Again, here's a link to my workspace: load(url("https://dl.dropboxusercontent.com/s/06oz5j41nif7la5/example.RData?dl=0"))

And here is the code I'm using to get there:

library(udpipe)
library(dplyr) # data wrangling
library(readr) # import data
library(ggplot2) # viz
library(stm) # STM
library(tidytext) # Tf-idf
library(tm) # DTM stuff
library(quanteda) # For using ngrams in STM

rm(list = ls())

abstracts <- read_delim("Data/5528_demand_ta.txt", 
                        delim = "	", escape_double = FALSE, 
                        col_names = TRUE, trim_ws = TRUE)


abstracts <- rename(abstracts, doc_id = cpid)
abstracts$doc_id <- as.character(abstracts$doc_id)

# Download english dictionary
ud_model <- udpipe_download_model(language = "english")
ud_model <- udpipe_load_model(ud_model$file_model)

# Interpret abstracts assuming English
x <- udpipe_annotate(ud_model, x = abstracts$abstract, doc_id = abstracts$doc_id)
x <- as.data.frame(x)

# Regroup terms
data <- paste.data.frame(x, term = "lemma", group = c("doc_id"))
data <- left_join(data, abstracts) %>%
  rename(term = lemma) %>%
  select(doc_id, term , year)

# Prepare text
processed <- textProcessor(documents = data$term, 
                           metadata = data,
                           lowercase = TRUE, 
                           removestopwords = TRUE,
                           removenumbers = TRUE,
                           removepunctuation = TRUE,
                           stem = FALSE)
out <- prepDocuments(processed$documents, 
                     processed$vocab, 
                     processed$meta, 
                     lower.thresh = 20, # term must appear in at least n docs to matter
                     upper.thres = 1000) # I've been using about 1/3 of documents as an upper thresh

# Build model allowing tSNE to pick k (should result in 52 topics)
stm_mod <- stm(documents = out$documents,
               vocab = out$vocab,
               K = 0,
               init.type = "Spectral",
               prevalence = ~ s(year),
               data = out$meta,
               max.em.its = 500, # Max number of runs to attempt 
               seed = 831)

###################################################################################
########### If you loaded the workspace from my link, then you are here ###########
###################################################################################

# Estimate effect of year
prep <- estimateEffect(formula = 1:52 ~ s(year), 
                       stmobj = stm_mod,
                       metadata = out$meta)

# Plot expected topic proportion
summary(prep, topics=1)
plot.estimateEffect(prep, 
                    "year", 
                    method = "continuous", 
                    model = stm_mod,
                    topics = 5,
                    printlegend = TRUE, 
                    xaxt = "n", 
                    xlab = "Years")

Need help diagnosing cause of "Covariate matrix is singular" when estimating effect in structural topic model (stm)

Answers (1)

Related Questions

Need help diagnosing cause of &quot;Covariate matrix is singular&quot; when estimating effect in structural topic model (stm)

Answers (1)

Related Questions

Need help diagnosing cause of "Covariate matrix is singular" when estimating effect in structural topic model (stm)