I ball was rawt
I ball was rawt

Reputation: 61

Python Spacy KeyError: "[E018] Can't retrieve string for hash

I am trying to make my code run on Raspberry Pi 4 and been stuck on this error for hours. This code segment throws an error on it but runs perfectly on windows with the same project

def create_lem_texts(data): # as a list def sent_to_words(sentences): for sentence in sentences: yield (gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations

data_words = list(sent_to_words(data))
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)  # higher threshold fewer phrases.
bigram_mod = gensim.models.phrases.Phraser(bigram)

def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    print(os.getcwd())
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

data_words_nostops = remove_stopwords(data_words)
data_words_bigrams = make_bigrams(data_words_nostops)
print(os.getcwd())
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

return data_lemmatized

This code is in turned called by this function:

def assign_topics_tweet(tweets):
owd = os.getcwd()
print(owd)
os.chdir('/home/pi/Documents/pycharm_project_twitter/topic_model/')
print(os.getcwd())
lda = LdaModel.load("LDA26")
print(lda)
id2word = Dictionary.load('Id2Word')
print(id2word)
os.chdir(owd)
data = create_lem_texts(tweets)
corpus = [id2word.doc2bow(text) for text in data]
topics = []
for tweet in corpus:
    topics_dist = lda.get_document_topics(tweet)
    topics.append(topics_dist)
return topics

And here is the error message

    Traceback (most recent call last):
  File "/home/pi/Documents/pycharm_project_twitter/Twitter_Import.py", line 193, in <module>
    main()
  File "/home/pi/Documents/pycharm_project_twitter/Twitter_Import.py", line 169, in main
    topics = assign_topics_tweet(data)
  File "/home/pi/Documents/pycharm_project_twitter/TopicModel.py", line 238, in assign_topics_tweet
    data = create_lem_texts(tweets)
  File "/home/pi/Documents/pycharm_project_twitter/TopicModel.py", line 76, in create_lem_texts
    data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
  File "/home/pi/Documents/pycharm_project_twitter/TopicModel.py", line 67, in lemmatization
    texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
  File "/home/pi/Documents/pycharm_project_twitter/TopicModel.py", line 67, in <listcomp>
    texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
  File "token.pyx", line 871, in spacy.tokens.token.Token.lemma_.__get__
  File "strings.pyx", line 136, in spacy.strings.StringStore.__getitem__
KeyError: "[E018] Can't retrieve string for hash '18446744073541552667'. This usually refers to an issue with the `Vocab` or `StringStore`."

Process finished with exit code 1

I tried reinstalling spacy and the en model, running it directly on pi, spacy versions are the same on both my windows machine and on the Pi. And there is basically no information online on this error

Upvotes: 3

Views: 1250

Answers (1)

I ball was rawt
I ball was rawt

Reputation: 61

After three days of testing problem was solved by simply installing an older version of Spacy 2.0.1

Upvotes: 2

Related Questions