Saulo Ricci
Saulo Ricci

Reputation: 774

Apache Spark Streaming is Not Reading the Directory

I'm working on Spark Streaming and I want to set a local directory to stream data to my spark application such that every new text file on that directory will be streamed to my application. I tried using StreamingContext's textFileStream method but I haven't gotten any data from files I've moved to my specified local directory. Could you help me to find why this is happening?

So here is the code I've written:

def main():

    if len(sys.argv) != 5:
    print 'Usage: SPARK_HOME/bin/spark-submit CoinpipeVectorBuilder.py <SPARK_HOME> <dir_streaming> ' \
          '<dir_crawled_addresses> <dir_output_vectors>'
    sys.exit(1)

    #Set the path to crawled outputs according to the parameter passed to the spark script
    global path_crawled_output
    path_crawled_output = sys.argv[4]

    global sc, ssc
    sconf = SparkConf().setMaster("local[2]")\
        .setAppName("CoinPipeVectorBuilder")\
        .set("spark.hadoop.validateOutputSpecs", "false")
    sc = SparkContext(conf=sconf)
    ssc = StreamingContext(sc, 10)
    tx_and_addr_stream = ssc.textFileStream(sys.argv[2])

    tx_and_addr_stream.foreachRDD(parseAndBuildVectors)

    ssc.start()
    ssc.awaitTermination()

if __name__ == "__main__":
    main()

So inside the parseAndBuildVectors I get no data even if I move a new file to the specified directory I had passed to ssc.textFileStream

Upvotes: 2

Views: 1134

Answers (1)

Tinku
Tinku

Reputation: 751

Spark code execute on work. So work have not any access for your local file system. This is not possible directly. You can read stream file then make RDD Then can perform operation using spark. Spark can access only distributed data.

Upvotes: 1

Related Questions