Read streaming data from s3 using pyspark

Question

I would like to leverage python for its extremely simple text parsing and functional programming capabilities and also to tap into the rich offering of scientific computing libraries like numpy and scipy and hence would like to use pyspark for a task.

The task I am looking to perform at the outset is to read from a bucket where there are text files being written to as part of a stream. Could someone paste a code snippet of how to read streaming data from an s3 path using pyspark? I thought this could be done only using scala and java till recently but I just found out today that spark 1.2 onwards, streaming is supported in pyspark as well but am unsure whether S3 streaming is supported?

The way I used to do it in scala is to read it in as a HadoopTextFile I think and also use configuration parameters to set aws key and secret. How would I do something similar in pyspark?

Any help would be much appreciated.

Thanks in advance.

Read streaming data from s3 using pyspark

Answers (1)

Related Questions