Reputation: 111
I got a confused problem.I want to upload a hdfs file to all spark workers.The code is as follow:
import sys
import os
from pyspark.ml.feature import Word2Vec
from pyspark import SparkConf, SparkContext
from pyspark.sql import Row
import jieba.posseg as posseg
import jieba
if __name__ == "__main__":
reload(sys)
sys.setdefaultencoding('utf-8')
conf = SparkConf().setAppName('fenci_0')
sc = SparkContext(conf=conf)
date = '20180801'
scatelist = ['95']
#I want to add a hdfs_file to all spark worker
hdfs_file_path = '/home/a/part-00000'
sc.addFile(hdfs_file_path)
...
...
But it got a error like "java.io.FileNotFoundException: Added file file does not exist".
But I can access the hdfs_file_path,and can get the file content.Why this occured? I guess when add a hdfs file,the sc.addFile maybe required some prefix such as 'sc.add('hdfs//:hdfs_file_path')'?
I have search this on google and stackoverflow,but maybe the keyword I searched is not correct.Would you help me find the error?Thank you a lot.
Upvotes: 2
Views: 1154
Reputation: 1531
Yes.
You need to give the full HDFS path, maybe something like below:
sc.addFile('hdfs://<reference_to_name_node_or_name_service_ID>/home/a/part-00000')
This is because sc.addFile()
method can accept files from any filesystem(either a local file, or HDFS, or any other Hadoop supported filesystem, or even URIs).
Upvotes: 3