Apache Spark Streaming is Not Reading the Directory

Question

I'm working on Spark Streaming and I want to set a local directory to stream data to my spark application such that every new text file on that directory will be streamed to my application. I tried using StreamingContext's textFileStream method but I haven't gotten any data from files I've moved to my specified local directory. Could you help me to find why this is happening?

So here is the code I've written:

def main():

    if len(sys.argv) != 5:
    print 'Usage: SPARK_HOME/bin/spark-submit CoinpipeVectorBuilder.py   ' \
          ' '
    sys.exit(1)

    #Set the path to crawled outputs according to the parameter passed to the spark script
    global path_crawled_output
    path_crawled_output = sys.argv[4]

    global sc, ssc
    sconf = SparkConf().setMaster("local[2]")\
        .setAppName("CoinPipeVectorBuilder")\
        .set("spark.hadoop.validateOutputSpecs", "false")
    sc = SparkContext(conf=sconf)
    ssc = StreamingContext(sc, 10)
    tx_and_addr_stream = ssc.textFileStream(sys.argv[2])

    tx_and_addr_stream.foreachRDD(parseAndBuildVectors)

    ssc.start()
    ssc.awaitTermination()

if __name__ == "__main__":
    main()

So inside the parseAndBuildVectors I get no data even if I move a new file to the specified directory I had passed to ssc.textFileStream

Apache Spark Streaming is Not Reading the Directory

Answers (1)

Related Questions