Reputation: 169
Hi I'm working with the last example in this tutorial: Topics proportions over time.
I run it for my data with this code
# Import text data
tweets <- read_xlsx("C:/R/data.xlsx")
textdata <- tweets$text
#Load in the library 'stringr' so we can use the str_replace_all function.
#Remove URL's
textdata <- str_replace_all(textdata, "[a-z,A-Z,0-9]*","")
textdata <- gsub("@\\w+", " ", textdata) # Remove user names (all proper names if you're wise!)
textdata <- iconv(textdata, to = "ASCII", sub = " ") # Convert to basic ASCII text to avoid silly characters
textdata <- gsub("#\\w+", " ", textdata)
textdata <- gsub("http.+ |http.+$", " ", textdata) # Remove links
textdata <- gsub("[[:punct:]]", " ", textdata) # Remove punctuation
#Change all the text to lower case
textdata <- tolower(textdata)
#Remove Stopwords. "SMART" is in reference to english stopwords from the SMART information retrieval system and stopwords from other European Languages.
textdata <- tm::removeWords(x = textdata, c(stopwords(kind = "SMART")))
textdata <- gsub(" +", " ", textdata) # General spaces (should just do all whitespaces no?)
# Convert to tm corpus and use its API for some additional fun
corpus <- Corpus(VectorSource(textdata)) # Create corpus object
#Make a Document Term Matrix
dtm <- DocumentTermMatrix(corpus)
ui = unique(dtm$i) = dtm[ui,]
#Fixes this error: "Each row of the input matrix needs to contain at least one non-zero entry" See:
#rowTotals <- apply(datatm , 1, sum) #Find the sum of words in each Document <- datatm[rowTotals> 0, ]
k <- 7
ldaTopics <- LDA(, method = "Gibbs", control=list(alpha = 0.1, seed = 77), k = k)
#topics by year
tmResult <- posterior(ldaTopics)
theta <- tmResult$topics
terms(ldaTopics, 7)
tweets$decade <- paste0(substr(tweets$date2, 0, 3), "0")
topic_proportion_per_decade <- aggregate(theta, by = list(decade = tweets$decade), mean)
top5termsPerTopic <- terms(topicModel, 7)
topicNames <- apply(top5termsPerTopic, 2, paste, collapse=" ")
# set topic names to aggregated columns
colnames(topic_proportion_per_decade)[2:(K+1)] <- topicNames
# reshape data frame
vizDataFrame <- melt(topic_proportion_per_decade, id.vars = "decade")
# plot topic proportions per deacde as bar plot
ggplot(vizDataFrame, aes(x=decade, y=value, fill=variable)) +
geom_bar(stat = "identity") + ylab("proportion") +
scale_fill_manual(values = paste0(alphabet(20), "FF"), name = "decade") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Here is the excel file to the input data
I got the error when I run the line with the aggregate function, I can't find out what is going on with the aggregate, I created the "decade" variable the same as in the tutoria, I show it and looks ok, the theta variable is also ok.. I changed several times the aggregate function according for example to this post Error in : arguments must have same length
But still have the same error.. please help
Upvotes: 0
Views: 3720
Reputation: 121
I am not sure what you want to achieve with the command
topic_proportion_per_decade <- aggregate(theta, by = list(decade = tweets$decade), mean)
As far as I see you produce only one decade with
tweets$decade <- paste0(substr(tweets$date2, 0, 3), "0")
With all the preprocessing from tweets
to textdata
you're producing a few empty lines. This is where your problem starts.
Textdata with its new empty lines is the basis of your corpus
and your dtm
. You get rid of them with the lines:
ui = unique(dtm$i) = dtm[ui,]
At the same time you're basically deleting the empty columns in the dtm, thereby changing the length of your object. This new dtm without the empty cells is
then your new basis for the topic model. This is coming back to haunt you, when you try to use aggregate()
with two objects of different lengths: tweets$decade
, which is still the old length of 3418 with theta
, that is produced by the topic model, which in turn is based on -- remember, the one with fewer rows.
What I would suggest is to, first, get an ID-column in tweets
. Later on you can use the IDs to find out what texts later on get deleted by your preprocessing and match the length of tweet$decade
and theta
I rewrote your code -- try this out:
# Import text data
tweets <- read_xlsx("data.xlsx")
## Include ID for later
tweets$ID <- 1:nrow(tweets)
textdata <- tweets$text
#Load in the library 'stringr' so we can use the str_replace_all function.
#Remove URL's
textdata <- str_replace_all(textdata, "[a-z,A-Z,0-9]*","")
textdata <- gsub("@\\w+", " ", textdata) # Remove user names (all proper names if you're wise!)
textdata <- iconv(textdata, to = "ASCII", sub = " ") # Convert to basic ASCII text to avoid silly characters
textdata <- gsub("#\\w+", " ", textdata)
textdata <- gsub("http.+ |http.+$", " ", textdata) # Remove links
textdata <- gsub("[[:punct:]]", " ", textdata) # Remove punctuation
#Change all the text to lower case
textdata <- tolower(textdata)
#Remove Stopwords. "SMART" is in reference to english stopwords from the SMART information retrieval system and stopwords from other European Languages.
textdata <- tm::removeWords(x = textdata, c(stopwords(kind = "SMART")))
textdata <- gsub(" +", " ", textdata) # General spaces (should just do all whitespaces no?)
# Convert to tm corpus and use its API for some additional fun
corpus <- Corpus(VectorSource(textdata)) # Create corpus object
#Make a Document Term Matrix
dtm <- DocumentTermMatrix(corpus)
ui = unique(dtm$i) = dtm[ui,]
#Fixes this error: "Each row of the input matrix needs to contain at least one non-zero entry" See:
#rowTotals <- apply(datatm , 1, sum) #Find the sum of words in each Document <- datatm[rowTotals> 0, ]
k <- 7
ldaTopics <- LDA(, method = "Gibbs", control=list(alpha = 0.1, seed = 77), k = k)
#topics by year
tmResult <- posterior(ldaTopics)
theta <- tmResult$topics
terms(ldaTopics, 7)
id <- data.frame(ID =$dimnames$Docs)
colnames(id) <- "ID"
tweets$decade <- paste0(substr(tweets$date2, 0, 3), "0")
tweets_new <- merge(id, tweets, by.x="ID", by.y = "ID", all.x = T)
topic_proportion_per_decade <- aggregate(theta, by = list(decade = tweets_new$decade), mean)
Upvotes: 3