Reputation: 41
I have a project I'm working on in tidytext, which I'm pretty new to. My input data is currently in the form of individual .txt files in a folder. I successfully used get_sentiments() to track the positive/negative sentiments of my data, but I'm looking to do some more advanced topic modelling.
I'm trying to work off of this guide, but I'm struggling to get started. It looks like the input data you need to do topic modelling is a DocumentTermMatrix, which I'm unsure how to create. Is there a way to turn the data I currently have as individual files into this format so that I can use the methods described in that guide?
Upvotes: 1
Views: 119
Reputation: 11613
If you're interested in faster performance and/or using tidy data principles, then you can avoid using the tm package altogether. Check out this chapter of the book on how to convert back and forth from tidy data structures to a document-term matrix.
Here is a guide on how to get started with topic modeling. After your data is in memory (I recommend using readr::read_lines()
with text files), you would do something like this:
#> stm v1.3.5 successfully loaded. See ?stm for help.
#> Papers, resources, and other materials at
austen_sparse <- austen_books() %>% ## austenbooks like the output of read_lines()
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(book, word) %>%
cast_sparse(book, word, n) ## cast_sparse() is what converts to a DTM
#> Joining, by = "word"
topic_model <- stm(austen_sparse, K = 12, verbose = FALSE, init.type = "Spectral")
#> A topic model with 12 topics, 6 documents and a 13914 word dictionary.
#> Topic 1 Top Words:
#> Highest Prob: anne, captain, elliot, lady, wentworth, charles, time
#> FREX: elliot, wentworth, walter, anne, russell, musgrove, louisa
#> Lift: acknowledgement, lyme, benwick, henrietta, musgrove, walter, kellynch
#> Score: elliot, wentworth, walter, russell, musgrove, anne, louisa
#> Topic 2 Top Words:
#> Highest Prob: emma, miss, harriet, weston, knightley, elton, jane
#> FREX: weston, knightley, elton, woodhouse, fairfax, churchill, hartfield
#> Lift: _broke_, elton's, bates, elton, emma's, enscombe, fairfax
#> Score: emma, weston, knightley, elton, woodhouse, fairfax, harriet
#> Topic 3 Top Words:
#> Highest Prob: elinor, marianne, time, dashwood, sister, edward, mother
#> FREX: elinor, marianne, dashwood, jennings, willoughby, brandon, ferrars
#> Lift: 1811, dashwoods, jennings's, palmer, barton, berkeley, brandon
#> Score: elinor, marianne, dashwood, jennings, willoughby, lucy, brandon
#> Topic 4 Top Words:
#> Highest Prob: fanny, crawford, miss, sir, edmund, time, thomas
#> FREX: crawford, edmund, bertram, norris, rushworth, mansfield, julia
#> Lift: _allow_, bertram, crawford, crawford's, norris, rushworth, susan
#> Score: fanny, crawford, edmund, thomas, bertram, norris, rushworth
#> Topic 5 Top Words:
#> Highest Prob: catherine, miss, tilney, time, isabella, thorpe, morland
#> FREX: tilney, catherine, thorpe, morland, isabella, allen, henry
#> Lift: abbeys, average, camilla, causeless, closets, convent, cravats
#> Score: catherine, tilney, thorpe, morland, allen, isabella, eleanor
#> Topic 6 Top Words:
#> Highest Prob: elizabeth, darcy, bennet, miss, jane, bingley, time
#> FREX: darcy, bennet, bingley, wickham, collins, lydia, lizzy
#> Lift: _accident_, lucas, bennet, bingley, bourgh, collins, darcy's
#> Score: darcy, elizabeth, bennet, bingley, wickham, collins, lydia
#> Topic 7 Top Words:
#> Highest Prob: catherine, miss, tilney, time, isabella, thorpe, morland
#> FREX: tilney, catherine, thorpe, morland, isabella, allen, henry
#> Lift: affrighted, andrews, average, blaize, camilla, causeless, closets
#> Score: catherine, tilney, thorpe, morland, allen, isabella, eleanor
#> Topic 8 Top Words:
#> Highest Prob: anne, captain, elliot, lady, wentworth, charles, time
#> FREX: elliot, wentworth, walter, anne, russell, musgrove, louisa
#> Lift: alicia, lyme, musgrove, walter, benwick, henrietta, kellynch
#> Score: elliot, wentworth, walter, russell, musgrove, anne, louisa
#> Topic 9 Top Words:
#> Highest Prob: catherine, miss, tilney, time, isabella, thorpe, morland
#> FREX: tilney, catherine, thorpe, morland, isabella, allen, henry
#> Lift: alps, andrews, blaize, france, gloucestershire, heroic, heroine
#> Score: catherine, tilney, thorpe, morland, allen, isabella, eleanor
#> Topic 10 Top Words:
#> Highest Prob: catherine, miss, tilney, time, isabella, thorpe, morland
#> FREX: tilney, catherine, thorpe, morland, isabella, allen, henry
#> Lift: antiquity, france, gloucestershire, heroic, lid, eleanor, eleanor's
#> Score: catherine, tilney, thorpe, morland, allen, isabella, eleanor
#> Topic 11 Top Words:
#> Highest Prob: anne, captain, elliot, lady, wentworth, charles, time
#> FREX: elliot, wentworth, walter, anne, russell, musgrove, louisa
#> Lift: archibald, lyme, walter, benwick, henrietta, kellynch, musgrove
#> Score: elliot, wentworth, walter, russell, musgrove, anne, louisa
#> Topic 12 Top Words:
#> Highest Prob: catherine, miss, tilney, time, isabella, thorpe, morland
#> FREX: tilney, catherine, thorpe, morland, isabella, allen, anyone's
#> Lift: anyone's, eleanor, eleanor's, heroine, northanger, thorpe's, thorpes
#> Score: catherine, tilney, thorpe, morland, allen, anyone's, isabella
#> # A tibble: 166,968 x 3
#> topic term beta
#> <int> <chr> <dbl>
#> 1 1 1 1.18e- 4
#> 2 2 1 1.15e-19
#> 3 3 1 5.51e- 5
#> 4 4 1 1.33e-19
#> 5 5 1 4.20e- 5
#> 6 6 1 2.68e- 5
#> 7 7 1 4.20e- 5
#> 8 8 1 1.18e- 4
#> 9 9 1 4.20e- 5
#> 10 10 1 4.20e- 5
#> # … with 166,958 more rows
Created on 2020-03-25 by the reprex package (v0.3.0)
Upvotes: 1
Reputation: 996
You can read all your .txt files into a df and create a DocumentTermMatrix out of it using tm
# make example text files
text1 <- c("hello world 77")
text2 <- c("What time is it? 23")
# read txt files
texts <- list.files("./data",full.names = TRUE) # you can replace this path with your folder path with the text files
readtext(texts) -> data
# transform the data to a corpus
Corpus(VectorSource(data$text)) -> corpus
# add normalizations (you can skip this or add more)
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
# make document-term matix
review_dtm <- DocumentTermMatrix(corpus)
<<DocumentTermMatrix (documents: 2, terms: 5)>>
Non-/sparse entries: 5/5
Sparsity : 50%
Maximal term length: 5
Weighting : term frequency (tf)
Upvotes: 1