anonuser0428
anonuser0428

Reputation: 12363

Read streaming data from s3 using pyspark

I would like to leverage python for its extremely simple text parsing and functional programming capabilities and also to tap into the rich offering of scientific computing libraries like numpy and scipy and hence would like to use pyspark for a task.

The task I am looking to perform at the outset is to read from a bucket where there are text files being written to as part of a stream. Could someone paste a code snippet of how to read streaming data from an s3 path using pyspark? I thought this could be done only using scala and java till recently but I just found out today that spark 1.2 onwards, streaming is supported in pyspark as well but am unsure whether S3 streaming is supported?

The way I used to do it in scala is to read it in as a HadoopTextFile I think and also use configuration parameters to set aws key and secret. How would I do something similar in pyspark?

Any help would be much appreciated.

Thanks in advance.

Upvotes: 2

Views: 4161

Answers (1)

Leo
Leo

Reputation: 2072

Check the "Basic Sources" section in the documentation: https://spark.apache.org/docs/latest/streaming-programming-guide.html

I believe you want something like

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

sc = SparkContext('local[2]', 'my_app')
ssc = StreamingContext(sc, 1)

stream = ssc.textFileStream('s3n://...')

Upvotes: 1

Related Questions