Retrieving data from s3 bucket in pyspark

Question

I am reading data from s3 bucket in pyspark . I need to parallelize read operation and doing some transformation on the data. But its throwing error. Below is the code.

s3 = boto3.resource('s3',aws_access_key_id=access_key,aws_secret_access_key=secret_key)
bucket = s3.Bucket(bucket)

prefix = 'clickEvent-2017-10-09'
files = bucket.objects.filter(Prefix = prefix)
keys=[k.key for k in files]
pkeys = sc.parallelize(keys)

I have a global variable d which is an empty list. And I am appending deviceId data into this.

applying flatMap on the keys

pkeys.flatMap(map_func)

This the function

 def map_func(key):
   print "in map func"
   for line in key.get_contents_as_string().splitlines():
    # parse one line of json
     content = json.loads(line)
     d.append(content['deviceID'])

But the above code gives me error. Can anyone help!

Retrieving data from s3 bucket in pyspark

Answers (1)

Related Questions