Reputation: 1042

How to read new text files when spark is already running?

I am using spark 2 to read data from HDFS and to treat them. In order to import my data from HDFS, I use the following:

JavaRDD<String> msg= spark.read().textFile("hdfs://myfolder/*").javaRDD();

But I wonder if spark reads new text files created after I started to run spark.

IF it is not the case, can you tell me how to do that ?

Thanks in advance

Upvotes: 0

Answers (1)

Tushar Adeshara

Reputation: 46

Do you think streaming api will work for you? It can monitor directory for new files and continue to process them as they come if all data is not available initially.

From http://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers

"For reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as: sample code

Spark Streaming will monitor the directory dataDirectory and process any files created in that directory (files written in nested directories not supported). Note that

The files must have the same data format.
The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read. "

Upvotes: 1

How to read new text files when spark is already running?

Answers (1)

Related Questions