Spark Streaming: How to get the filename of a processed file in Python

Question

I'm sort of a noob to Spark (and also Python honestly) so please forgive me if I've missed something obvious.

I am doing file streaming with Spark and Python. In the first example I did, Spark correctly listens to the given directory and counts word occurrences in the file, so I know that everything works in terms of listening to the directory.

Now I am trying to get the name of the file that is processed for auditing purposes. I read here http://mail-archives.us.apache.org/mod_mbox/spark-user/201504.mbox/%3CCANvfmP8OC9jrpVgWsRWfqjMxeYd6sE6EojfdyFy_GaJ3BO43_A@mail.gmail.com%3E that this is no trivial task. I got a possible solution here http://mail-archives.us.apache.org/mod_mbox/spark-user/201502.mbox/%3CCAEgyCiZbnrd6Y_aG0cBRCVC1u37X8FERSEcHB=tR3A2VGrGrPQ@mail.gmail.com%3E and I have tried implementing it as follows:

from __future__ import print_function

import sys

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

def fileName(data):
    string = data.toDebugString

if __name__ == "__main__":
    sc = SparkContext(appName="PythonStreamingFileNamePrinter")
    ssc = StreamingContext(sc, 1)
    lines = ssc.textFileStream("file:///test/input/")
    files = lines.foreachRDD(fileName)
    print(files)
    ssc.start()
    ssc.awaitTermination()

Unfortunately, now rather than listening at the folder every second, it listens once, outputs 'None' and then just waits doing nothing. The only difference between this and the code that did work is the

files = lines.foreachRDD(fileName)

Before I even worry about getting the filename (tomorrow's problems) can anybody see why this is only checking the directory once?

Thanks in advance M

swinefish · Accepted Answer

So it was a noob error. I'm posting my solution for reference for myself and others.

As pointed out by @user3689574, I was not returning the debug string in my function. This fully explains why I was getting the 'None'.

Next, I was printing the debug outside of the function, meaning it was never part of the foreachRDD. Moving it into the function as follows:

def fileName(data):
    debug = data.toDebugString()
    print(debug)

This prints the debug information as it should, and continues to listen to the directory, as it should. Changing that fixed my initial problem. In terms of getting the file name, that has become pretty straightforward.

The debug string when there is no change in the directory is as follows:

(0) MapPartitionsRDD[1] at textFileStream at NativeMethodAccessorImpl.java:-2 [] | UnionRDD[0] at textFileStream at NativeMethodAccessorImpl.java:-2 []

Which neatly indicates that there is no file. When a file is copied into the directory, the debug output is as follows:

(1) MapPartitionsRDD[42] at textFileStream at NativeMethodAccessorImpl.java:-2 [] | UnionRDD[41] at testFileStream at NativeMethodAccessorImpl.java:-2 [] | file:/test/input/test.txt New HadoopRDD[40] at textFileStream at NativeMethodAccessorImpl.java:-2 []

Which, with a quick regex, gives you the file name with little trouble. Hope this helps somebody else.

Spark Streaming: How to get the filename of a processed file in Python

Answers (2)

Related Questions