Reputation: 97
When I run the following script in Dataproc
import nltk
nltk.download('wordnet')
The nltk_data is downloaded only in master node but not in worker nodes. Thus submitting PySpark job in dataproc it is failing to read from worker nodes.
What solutions do you suggest? How can download nltk_data in worker nodes too?
Upvotes: 1
Views: 134
Reputation: 4455
You can use init actions to do this on all cluster nodes: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions
Upvotes: 1