Jean
Jean

Reputation: 429

Spark Streaming textFileStream COPYING

I'm trying to monitor a repository in HDFS to read and process data in files copied to it (to copy files from local system to HDFS I use hdfs dfs -put ), sometimes it generates the problem : Spark Streaming: java.io.FileNotFoundException: File does not exist: .COPYING so I read the problems in forums and the question here Spark Streaming: java.io.FileNotFoundException: File does not exist: <input_filename>._COPYING_ According to what I read the problem is linked to Spark streaming reading the file before it finishes being copied in HDFS and on Github : https://github.com/maji2014/spark/blob/b5af1bdc3e35c53564926dcbc5c06217884598bb/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala , they say that they corrected the problem but only for FileInputDStream as I could see but I'm using textFileStream When I tried to use FileInputDStream the IDE throws an error the Symbol is not accessible from this place. Does anyone know how to filter out the files that are still COPYING because I tried :

var lines = ssc.textFileStream(arg(0)).filter(!_.contains("_COPYING_") 

but that didn't work and it's expected because the filter should be applied on the name of the file process I guess which I can't access As you can see I did plenty of research before asking the question but didn't get lucky , Any help please ?

Upvotes: 1

Views: 1418

Answers (1)

Vale
Vale

Reputation: 1124

So I had a look: -put is the wrong method. Look at the final comment: you have to use -rename in your shell script to have an atomical transaction on the HDFS.

Upvotes: 2

Related Questions