Reputation: 33
I'm getting a CalledProcessError "non-zero exit status 1" error when I run the Gensim LDAMallet model on my full corpus of ~16 million documents. Interestingly enough, if I run the exact same code on a testing corpus of ~160,000 documents the code runs perfectly fine. Since it's working fine on my small corpus I'm inclined to think that the code is fine, but I'm not sure what else would/could cause this error...
I've tried editing the mallet.bat file as suggested here, but to no avail. I've also double checked the paths, but that shouldn't be an issue given that it works with a smaller corpus.
id2word = corpora.Dictionary(lists_of_words)
corpus =[id2word.doc2bow(doc) for doc in lists_of_words]
num_topics = 30
os.environ.update({'MALLET_HOME':r'C:/mallet-2.0.8/'})
mallet_path = r'C:/mallet-2.0.8/bin/mallet'
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
Here's the full traceback and error:
File "<ipython-input-57-f0e794e174a6>", line 8, in <module>
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\wrappers\ldamallet.py", line 132, in __init__
self.train(corpus)
File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\wrappers\ldamallet.py", line 273, in train
self.convert_input(corpus, infer=False)
File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\wrappers\ldamallet.py", line 262, in convert_input
check_output(args=cmd, shell=True)
File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\utils.py", line 1918, in check_output
raise error
CalledProcessError: Command 'C:/mallet-2.0.8/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input C:\Users\user\AppData\Local\Temp\2\e1ba4a_corpus.txt --output C:\Users\user\AppData\Local\Temp\2\e1ba4a_corpus.mallet' returned non-zero exit status 1.
Upvotes: 1
Views: 594
Reputation: 1202
I'm glad you found my post and I'm sorry it didn't work for you. I hit that error for a combination of reasons mainly that Java was not installed property and the path wasn't calling the environment variables.
Since your code runs on a smaller data set I'd look first at your data. Mallet is finicky in that it only accepts the cleanest data it may have hit a null, punctuation, or a float.
Did you take a sample size of your dictionary or did you pass in the entire data set?
This is basically what it is doing: sentence into words - words into numbers - then counted for frequency like:
[(3, 1), (13, 1), (37, 1)]
Word 3 ("assist") appears 1 time. Word 13 ("payment") appears 1 time. Word 37 ("account") appears 1 time.
Then your LDA looks at one word and scores in in terms of how frequently it occurs with all other words in the dictionary and it does that for the whole dictionary so if you're letting it look at millions and millions of words it's gonna crash real fast.
This is how I implemented mallet and shrunk my dictionary not including stemming or other preprocessing steps:
# we create a dictionary of all the words in the csv by iterating through
# contains the number of times a word appears in the training set.
dictionary = gensim.corpora.Dictionary(processed_docs[:])
count = 0
for k, v in dictionary.iteritems():
print(k, v)
count += 1
if count > 10:
break
# we want to throw out words that are so frequent that they tell us little about the topic
# as well as words that are too infrequent >15 rows then keep just 100,000 words
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
# the words become numbers and are then counted for frequency
# consider a random row 4310 - it has 27 words word indexed 2 shows up 4 times
# preview the bag of words
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
bow_corpus[4310]
os.environ['MALLET_HOME'] = 'C:\\mallet\\mallet-2.0.8'
mallet_path = 'C:\\mallet\\mallet-2.0.8\\bin\\mallet'
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=bow_corpus, num_topics=20, alpha =.1,
id2word=dictionary, iterations = 1000, random_seed = 569356958)
Also I would separate your ldamallet into a separate cell as the compile time is slow especially on a data set that size. I hope this helped let me know if you are still hitting errors :)
Upvotes: 1