Reputation: 715
How to open a file which is stored in HDFS - Here the input file is from HDFS - If I give the file as bellow , I wont be able to open , It will show as file not found
from pyspark import SparkConf,SparkContext
conf = SparkConf ()
sc = SparkContext(conf = conf)
def getMovieName():
movieNames = {}
with open ("/user/sachinkerala6174/inData/movieStat") as f:
for line in f:
fields = line.split("|")
mID = fields[0]
mName = fields[1]
movieNames[int(fields[0])] = fields[1]
return movieNames
nameDict = sc.broadcast(getMovieName())
My assumption was to use like
with open (sc.textFile("/user/sachinkerala6174/inData/movieStat")) as f:
But that also didnt work
Upvotes: 4
Views: 12207
Reputation: 10450
To read the textfile
into rdd
:
rdd_name = sc.textFile("/user/sachinkerala6174/inData/movieStat")
You can use collect()
in order to use it in pure python (not recommended - use only on very small data), or use spark rdd
methods in order to manipulate it using pyspark
methods (the recommended way)
More info pyspark API:
textFile(name, minPartitions=None, use_unicode=True)
Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.
If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode. (Added in Spark 1.2)
>>> path = os.path.join(tempdir, "sample-text.txt") >>> with open(path, "w") as testFile: ... _ = testFile.write("Hello world!") >>> textFile = sc.textFile(path) >>> textFile.collect() [u'Hello world!']
Upvotes: 3