Reputation: 1042
I am using spark 2 to read data from HDFS and to treat them. In order to import my data from HDFS, I use the following:
JavaRDD<String> msg= spark.read().textFile("hdfs://myfolder/*").javaRDD();
But I wonder if spark reads new text files created after I started to run spark.
IF it is not the case, can you tell me how to do that ?
Thanks in advance
Upvotes: 0
Views: 1790
Reputation: 46
Do you think streaming api will work for you? It can monitor directory for new files and continue to process them as they come if all data is not available initially.
From http://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers
"For reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as: sample code
Spark Streaming will monitor the directory dataDirectory and process any files created in that directory (files written in nested directories not supported). Note that
The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read. "
Upvotes: 1