How to stream only part of a file with Apache Spark

Question

I'm trying to use Spark Streaming and Spark SQL with the Python API.

I have a file that is constantly edited, by adding some rows every random N seconds.

This file can be a JSON, XML, CSV or TXT, even a SQL table: I'm totally free to choose the best solution for my situation.

I have a certain number of fields, about 4-5. Take this table as example:

+-------+------+-------+--------------------+ 
| event |  id  | alert |      datetime      |
+-------+------+-------+--------------------+
| reg   |  1   | def1  | 06.06.17-17.24.30  |
+-------+------+-------+--------------------+
| alt   |  2   | def2  | 06.06.17-17.25.11  |
+-------+------+-------+--------------------+
| mot   |  3   | def5  | 06.06.17-17.26.01  |
+-------+------+-------+--------------------+
| mot   |  4   | def5  | 06.06.17-17.26.01  |
+-------+------+-------+--------------------+

I want to stream with Spark Streaming only new lines. So, if I added 2 new rows, the next time I want to stream only these two rows instead of the entire file (already streamed)

Moreover, I want to filter or compute a Spark SQL query on the entire same file, each time a new row is found. For example, I want to select the event "mot" only if it appears two times in 10 minutes, and this query must be redone each time the file change and new data arrived.

Can Spark Streaming and Spark SQL handle these situations? And how?

user10004552 · Accepted Answer

It is not supported for file sources Spark

Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, orc, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations

and similarly for legacy streaming (note this 2.2 documentation, but implementation didn't change)

The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.

How to stream only part of a file with Apache Spark

Answers (1)

Related Questions