bubunny
bubunny

Reputation: 39

Mallet - Topic Modeling - Stopwords Error

Although i add extra stopwords list and default stopwords list when i use MALLET for topic modeling, some stop words appear in topic models. For example "ın", "ıf", "ıt". How do i ensure that this stopwords don't appear in topic models? Topic models is below.

0 5 ı ıt time room door house people eyes thing night woman day make girl face mother voice car home

1 5 ıt ın fact sense point experience order form human action common general religious law part change number case evidence

2 5 time place work water long make cut ın square large top house side built machine building clay piece design

3 5 school people ın development national american members social program system economic group problems education class students work policy children

4 5 year york week home music american city house president day school club william show white ın days family night

5 5 ıt time fire feet river long road side miles game land run hit war gun big ball began arms

6 5 hands water white hand ın black food eyes face slowly sun cold ıt life red head hot long body

7 5 ın number system data surface temperature high low type volume information material pressure feed form small results shown method

8 5 world life church god war time great death book english ın century history england french west soviet love spirit

9 5 state year united government general business federal department court tax cost million company secretary act public ın service industry

Thanks for advice

Upvotes: 3

Views: 942

Answers (1)

Sir Cornflakes
Sir Cornflakes

Reputation: 665

Check the spelling of your stopwords. Mallet lowerceses your corpus by default, but it does not lowercase your stopwords!

Also check the format of your stopword file: Mallet expects it to be one-word-per-line.

And don't forget the option --stoplist-file yourstopwordfile.txt to the command mallet import-dir.

EDIT: Beware of OCR errors in your input file: I see that in the topics words like "ın" are spelled with a dotless i (as used in Turkish orthography), not with the usual dotted i. So either apply some OCR correction before topic modelling or make the misspelled ın's with dotless i additional stopwords.

EDIT2: There is another possible source for the dotless-i "ın", "ıf", "ıt": Mallet lowercases all words in the corpus. When your locale is set to Turkish, Java lowercases a capital I to a dotless i. Check your JAVA language settings and create the topic modell again from scratch.

Upvotes: 1

Related Questions