Python Gensim LDAMallet CalledProcessError with large corpus (runs fine with small corpus)

Question

I'm getting a CalledProcessError "non-zero exit status 1" error when I run the Gensim LDAMallet model on my full corpus of ~16 million documents. Interestingly enough, if I run the exact same code on a testing corpus of ~160,000 documents the code runs perfectly fine. Since it's working fine on my small corpus I'm inclined to think that the code is fine, but I'm not sure what else would/could cause this error...

I've tried editing the mallet.bat file as suggested here, but to no avail. I've also double checked the paths, but that shouldn't be an issue given that it works with a smaller corpus.

id2word = corpora.Dictionary(lists_of_words)
corpus =[id2word.doc2bow(doc) for doc in lists_of_words]
num_topics = 30
os.environ.update({'MALLET_HOME':r'C:/mallet-2.0.8/'})
mallet_path = r'C:/mallet-2.0.8/bin/mallet'
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)

Here's the full traceback and error:

  File "", line 8, in 
    ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)

  File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\wrappers\ldamallet.py", line 132, in __init__
    self.train(corpus)

  File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\wrappers\ldamallet.py", line 273, in train
    self.convert_input(corpus, infer=False)

  File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\wrappers\ldamallet.py", line 262, in convert_input
    check_output(args=cmd, shell=True)

  File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\utils.py", line 1918, in check_output
    raise error

CalledProcessError: Command 'C:/mallet-2.0.8/bin/mallet import-file --preserve-case --keep-sequence --remove-stopwords --token-regex "\S+" --input C:\Users\user\AppData\Local\Temp\2\e1ba4a_corpus.txt --output C:\Users\user\AppData\Local\Temp\2\e1ba4a_corpus.mallet' returned non-zero exit status 1.

Python Gensim LDAMallet CalledProcessError with large corpus (runs fine with small corpus)

Answers (1)

Related Questions