Reputation: 1769
I copy a simple Python script by Building a Wikipedia Text Corpus for Natural Language Processing to build the corpus by stripping all Wikipedia markup from the articles, using gensim. This is the cose:
"""
Creates a corpus from Wikipedia dump file.
Inspired by:
https://github.com/panyang/Wikipedia_Word2vec/blob/master/v1/process_wiki.py
"""
import sys
from gensim.corpora import WikiCorpus
def make_corpus(in_f, out_f):
"""Convert Wikipedia xml dump file to text corpus"""
output = open(out_f, 'w')
wiki = WikiCorpus(in_f)
i = 0
for text in wiki.get_texts():
output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
i = i + 1
if (i % 10000 == 0):
print('Processed ' + str(i) + ' articles')
output.close()
print('Processing complete!')
if __name__ == '__main__':
if len(sys.argv) != 3:
print('Usage: python make_wiki_corpus.py <wikipedia_dump_file> <processed_text_file>')
sys.exit(1)
in_f = sys.argv[1]
out_f = sys.argv[2]
make_corpus(in_f, out_f)
Anyway, I obtained the error:
ModuleNotFoundError: No module named 'gensim'
although I have installed the gensim
package:
python3 -m pip install gensim
EDIT. If I try with
pip install -U gensim
I obtain the error
ImportError: cannot import name 'SourceDistribution' from
'pip._internal.distributions.source' (C:\Users\Standard\Anaconda3\lib\site-
packages\pip\_internal\distributions\source\__init__.py)
Upvotes: 0
Views: 385
Reputation: 6017
You do not have the gensim
module installed in your system.
pip install -U gensim
Or download it from https://pypi.python.org/pypi/gensim.
gensim
depends on scipy
and numpy
. You must have them installed prior to installing gensim
.
There is a bug in pip 20.0.0
. Either upgrade to 20.0.1 using:
python get-pip.py
Or downgrade to 19.3.1.
python get-pip.py pip==19.3.1
Upvotes: 1