How to open a file which is stored in HDFS in pySpark using with open

Question

How to open a file which is stored in HDFS - Here the input file is from HDFS - If I give the file as bellow , I wont be able to open , It will show as file not found

from pyspark import SparkConf,SparkContext
conf = SparkConf ()
sc = SparkContext(conf = conf)
def getMovieName():
    movieNames = {}
    with open ("/user/sachinkerala6174/inData/movieStat") as f:
        for line in f:
            fields = line.split("|")
            mID = fields[0]
            mName = fields[1]
            movieNames[int(fields[0])] = fields[1]
            return movieNames
nameDict = sc.broadcast(getMovieName())

My assumption was to use like

with open (sc.textFile("/user/sachinkerala6174/inData/movieStat")) as f:

But that also didnt work

Yaron · Accepted Answer

To read the textfile into rdd:

rdd_name = sc.textFile("/user/sachinkerala6174/inData/movieStat")

You can use collect() in order to use it in pure python (not recommended - use only on very small data), or use spark rdd methods in order to manipulate it using pyspark methods (the recommended way)

More info pyspark API:

textFile(name, minPartitions=None, use_unicode=True)

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

If use_unicode is False, the strings will be kept as str (encoding as utf-8), which is faster and smaller than unicode. (Added in Spark 1.2)
>>> path = os.path.join(tempdir, "sample-text.txt")
>>> with open(path, "w") as testFile:
...    _ = testFile.write("Hello world!")
>>> textFile = sc.textFile(path)
>>> textFile.collect()
[u'Hello world!']

How to open a file which is stored in HDFS in pySpark using with open

Answers (1)

Related Questions