Reputation: 774
I'm working on Spark Streaming and I want to set a local directory to stream data to my spark application such that every new text file on that directory will be streamed to my application. I tried using StreamingContext
's textFileStream
method but I haven't gotten any data from files I've moved to my specified local directory. Could you help me to find why this is happening?
So here is the code I've written:
def main():
if len(sys.argv) != 5:
print 'Usage: SPARK_HOME/bin/spark-submit CoinpipeVectorBuilder.py <SPARK_HOME> <dir_streaming> ' \
'<dir_crawled_addresses> <dir_output_vectors>'
sys.exit(1)
#Set the path to crawled outputs according to the parameter passed to the spark script
global path_crawled_output
path_crawled_output = sys.argv[4]
global sc, ssc
sconf = SparkConf().setMaster("local[2]")\
.setAppName("CoinPipeVectorBuilder")\
.set("spark.hadoop.validateOutputSpecs", "false")
sc = SparkContext(conf=sconf)
ssc = StreamingContext(sc, 10)
tx_and_addr_stream = ssc.textFileStream(sys.argv[2])
tx_and_addr_stream.foreachRDD(parseAndBuildVectors)
ssc.start()
ssc.awaitTermination()
if __name__ == "__main__":
main()
So inside the parseAndBuildVectors I get no data even if I move a new file to the specified directory I had passed to ssc.textFileStream
Upvotes: 2
Views: 1134
Reputation: 751
Spark code execute on work. So work have not any access for your local file system. This is not possible directly. You can read stream file then make RDD Then can perform operation using spark. Spark can access only distributed data.
Upvotes: 1