Sachin Sukumaran
Sachin Sukumaran

Reputation: 715

How to deal with large dataset in for loop in pySpark

I know , when we are firing collect() , and if the data set is too large to fit in memory, spark will crashes due to the memory problem. So what is the right approach in one of the bellow case.

I have an rdd fmap and fmap is of larger size. If I want to do some processing inside the for loop, the bellow will work if the data set is of average size. If the data set is of larger size what will be the best approach

for x,(k,v) in fmap.collect():
    st = x + " " + k +  " " + str(v)
    mynewList.append(st) 

My intention is to format the data

My RDD
[
('HOMICIDE', ('2017', 1)), 
('DECEPTIVE PRACTICE', ('2015', 10)), 
('DECEPTIVE PRACTICE', ('2014', 3)), 
('DECEPTIVE PRACTICE', ('2017', 14)), 
('ROBBERY', ('2017', 1))
]
Expected result 
=============
[
('HOMICIDE', '2017', 1), 
('DECEPTIVE PRACTICE', '2015', 10), 
('DECEPTIVE PRACTICE', '2014', 3), 
('DECEPTIVE PRACTICE', '2017', 14), 
('ROBBERY', '2017', 1)
]

Upvotes: 1

Views: 2134

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35219

TL;DR Don't collect. I you do, and process data on the driver, there is no reason to use Spark. collect is useful for testing, but has negligible value in otherwise.

Just use map. Python 2:

rdd.map(lambda (x, (k,v)): x + " " + k +  " " + str(v))

Python 3:

rdd.map(lambda xkv: xkv[0] + " " + xkv[1][0] +  " " + str(xkv[1][1]))

Version independent:

def f(xkv):
    (x, (k, v)) = xkv
     return x + " " + k +  " " + str(v)

rdd.map(f)

To get tuples replace:

x + " " + k +  " " + str(v)

with:

(x, k, v)

or

(xkv[0], xkv[1][0], str(xkv[1][1]))

Upvotes: 3

Related Questions