nidhink1995
nidhink1995

Reputation: 11

Reading files from HDFS directory and creating a RDD in Spark using Python

I have some text files and I want to create an RDD using these files. The text files are stored in 'Folder_1' and 'Folder_2' and these folders are stored in the folder 'text_data'

When the files are stored in local storage, the following code works :

#Reading the corpus as an RDD

data_folder = '/home/user/text_data'

def read_data(data_folder):
    data = sc.parallelize([])
    for folder in os.listdir(data_folder):
        for txt_file in os.listdir( data_folder + '/' + folder  ):
            temp = open( data_folder + '/' + folder + '/' + txt_file)
            temp_da = temp.read()
            temp_da = unicode(temp_da, errors = 'ignore')
            temp.close()
            a = [ ( folder, temp_da) ]
            data = data.union(sc.parallelize( a ) )
    return data

The function read_data returns an RDD consisting of the text files.

How can I perform the above function if I move the 'text_data' folder to an HDFS directory?

The code is to be deployed in a Hadoop-Yarn Cluster running SPARK.

Upvotes: 0

Views: 15289

Answers (1)

sasubillis
sasubillis

Reputation: 76

Replace namenode of your hadoop environment below

hdfs_folder = 'hdfs://<namenode>/home/user/text_data/*'

def read_data(hdfs_folder):
    data = sc.parallelize([])
    data = sc.textFile(hdfs_folder)
    return data

This was tested in Spark 1.6.2 version

>>> hdfs_folder = 'hdfs://coord-1/tmp/sparktest/0.txt'
>>> def read_data(hdfs_folder):
...     data = sc.parallelize([])
...     data = sc.textFile(hdfs_folder)
...     return data
...
>>> read_data(hdfs_folder).count()
17/03/15 00:30:57 INFO SparkContext: Created broadcast 14 from textFile at NativeMethodAccessorImpl.java:-2
17/03/15 00:30:57 INFO SparkContext: Starting job: count at <stdin>:1
17/03/15 00:30:57 INFO SparkContext: Created broadcast 15 from broadcast at DAGScheduler.scala:1012
189
>>>

Upvotes: 1

Related Questions