Reputation: 39
Although i add extra stopwords list and default stopwords list when i use MALLET for topic modeling, some stop words appear in topic models. For example "ın", "ıf", "ıt". How do i ensure that this stopwords don't appear in topic models? Topic models is below.
0 5 ı ıt time room door house people eyes thing night woman day make girl face mother voice car home
1 5 ıt ın fact sense point experience order form human action common general religious law part change number case evidence
2 5 time place work water long make cut ın square large top house side built machine building clay piece design
3 5 school people ın development national american members social program system economic group problems education class students work policy children
4 5 year york week home music american city house president day school club william show white ın days family night
5 5 ıt time fire feet river long road side miles game land run hit war gun big ball began arms
6 5 hands water white hand ın black food eyes face slowly sun cold ıt life red head hot long body
7 5 ın number system data surface temperature high low type volume information material pressure feed form small results shown method
8 5 world life church god war time great death book english ın century history england french west soviet love spirit
9 5 state year united government general business federal department court tax cost million company secretary act public ın service industry
Thanks for advice
Upvotes: 3
Views: 942
Reputation: 665
Check the spelling of your stopwords. Mallet lowerceses your corpus by default, but it does not lowercase your stopwords!
Also check the format of your stopword file: Mallet expects it to be one-word-per-line.
And don't forget the option --stoplist-file yourstopwordfile.txt
to the command mallet import-dir
.
EDIT: Beware of OCR errors in your input file: I see that in the topics words like "ın" are spelled with a dotless i (as used in Turkish orthography), not with the usual dotted i. So either apply some OCR correction before topic modelling or make the misspelled ın's with dotless i additional stopwords.
EDIT2: There is another possible source for the dotless-i "ın", "ıf", "ıt": Mallet lowercases all words in the corpus. When your locale is set to Turkish, Java lowercases a capital I to a dotless i. Check your JAVA language settings and create the topic modell again from scratch.
Upvotes: 1