Obed
Obed

Reputation: 423

Need help diagnosing cause of "Covariate matrix is singular" when estimating effect in structural topic model (stm)

First things first. I've saved my workspace and you can load it with the following command: load(url("https://dl.dropboxusercontent.com/s/06oz5j41nif7la5/example.RData?dl=0"))

I have a number of abstract texts and I'm attempting to estimate a structural topic model to measure topic prevalence over time. The data contains a document id, abstract text, and year of publication.

I want to generate trends in expected topic proportion over time like the authors of the STM Vignette do here: Proportion trend I want to show

I'm able to create my topic model without issue, but when I attempt to run the estimateEffect() function from the stm package in R, I always get the following warning: enter image description here

And my trends look like this: enter image description here

In the documentation, the authors note that

The function will automatically check whether the covariate matrix is singular which generally results from linearly dependent columns. Some common causes include a factor variable with an unobserved level, a spline with degrees of freedom that are too high, or a spline with a continuous variable where a gap in the support of the variable results in several empty basis functions.

I've tried a variety of different models, using a 2-topic solution all the way up to 52-topic solution, always with the same result. If I remove the spline function from the "year" variable in my model and assume a linear fit, then estimateEffect() works just fine. So it must be an issue with the splined data. I just don't know what exactly.

Again, here's a link to my workspace: load(url("https://dl.dropboxusercontent.com/s/06oz5j41nif7la5/example.RData?dl=0"))

And here is the code I'm using to get there:

library(udpipe)
library(dplyr) # data wrangling
library(readr) # import data
library(ggplot2) # viz
library(stm) # STM
library(tidytext) # Tf-idf
library(tm) # DTM stuff
library(quanteda) # For using ngrams in STM

rm(list = ls())

abstracts <- read_delim("Data/5528_demand_ta.txt", 
                        delim = "\t", escape_double = FALSE, 
                        col_names = TRUE, trim_ws = TRUE)


abstracts <- rename(abstracts, doc_id = cpid)
abstracts$doc_id <- as.character(abstracts$doc_id)

# Download english dictionary
ud_model <- udpipe_download_model(language = "english")
ud_model <- udpipe_load_model(ud_model$file_model)

# Interpret abstracts assuming English
x <- udpipe_annotate(ud_model, x = abstracts$abstract, doc_id = abstracts$doc_id)
x <- as.data.frame(x)

# Regroup terms
data <- paste.data.frame(x, term = "lemma", group = c("doc_id"))
data <- left_join(data, abstracts) %>%
  rename(term = lemma) %>%
  select(doc_id, term , year)

# Prepare text
processed <- textProcessor(documents = data$term, 
                           metadata = data,
                           lowercase = TRUE, 
                           removestopwords = TRUE,
                           removenumbers = TRUE,
                           removepunctuation = TRUE,
                           stem = FALSE)
out <- prepDocuments(processed$documents, 
                     processed$vocab, 
                     processed$meta, 
                     lower.thresh = 20, # term must appear in at least n docs to matter
                     upper.thres = 1000) # I've been using about 1/3 of documents as an upper thresh

# Build model allowing tSNE to pick k (should result in 52 topics)
stm_mod <- stm(documents = out$documents,
               vocab = out$vocab,
               K = 0,
               init.type = "Spectral",
               prevalence = ~ s(year),
               data = out$meta,
               max.em.its = 500, # Max number of runs to attempt 
               seed = 831)

###################################################################################
########### If you loaded the workspace from my link, then you are here ###########
###################################################################################

# Estimate effect of year
prep <- estimateEffect(formula = 1:52 ~ s(year), 
                       stmobj = stm_mod,
                       metadata = out$meta)

# Plot expected topic proportion
summary(prep, topics=1)
plot.estimateEffect(prep, 
                    "year", 
                    method = "continuous", 
                    model = stm_mod,
                    topics = 5,
                    printlegend = TRUE, 
                    xaxt = "n", 
                    xlab = "Years")

Upvotes: 0

Views: 503

Answers (1)

Jim
Jim

Reputation: 191

A singular matrix simply means that you have linearly dependent rows or columns. First thing you could do is check the determinant of the matrix - a singular matrix implies a zero determinant - which means the matrix can't be inverted.

Next thing would be to identify the literally dependent rows (columns), you can do so using smisc::findDepMat(X, rows = TRUE, tol = 1e-10) for rows, and smisc::findDepMat(X, rows = FALSE, tol = 1e-10) for columns. You MAY be able to alter the levels of tol in findDepMat() and etol in stm() to arrive at a solution, probably an unstable solution, but a solution.

Upvotes: 1

Related Questions