Arman Malkhasyan
Arman Malkhasyan

Reputation: 97

nltk.download('wordnet') in Dataproc

When I run the following script in Dataproc

import nltk
nltk.download('wordnet')

The nltk_data is downloaded only in master node but not in worker nodes. Thus submitting PySpark job in dataproc it is failing to read from worker nodes.

What solutions do you suggest? How can download nltk_data in worker nodes too?

Upvotes: 1

Views: 134

Answers (1)

Igor Dvorzhak
Igor Dvorzhak

Reputation: 4455

You can use init actions to do this on all cluster nodes: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions

Upvotes: 1

Related Questions