Reputation: 715
I know , when we are firing collect() , and if the data set is too large to fit in memory, spark will crashes due to the memory problem. So what is the right approach in one of the bellow case.
I have an rdd fmap and fmap is of larger size. If I want to do some processing inside the for loop, the bellow will work if the data set is of average size. If the data set is of larger size what will be the best approach
for x,(k,v) in fmap.collect():
st = x + " " + k + " " + str(v)
mynewList.append(st)
My intention is to format the data
My RDD
[
('HOMICIDE', ('2017', 1)),
('DECEPTIVE PRACTICE', ('2015', 10)),
('DECEPTIVE PRACTICE', ('2014', 3)),
('DECEPTIVE PRACTICE', ('2017', 14)),
('ROBBERY', ('2017', 1))
]
Expected result
=============
[
('HOMICIDE', '2017', 1),
('DECEPTIVE PRACTICE', '2015', 10),
('DECEPTIVE PRACTICE', '2014', 3),
('DECEPTIVE PRACTICE', '2017', 14),
('ROBBERY', '2017', 1)
]
Upvotes: 1
Views: 2134
Reputation: 35219
TL;DR Don't collect
. I you do, and process data on the driver, there is no reason to use Spark. collect
is useful for testing, but has negligible value in otherwise.
Just use map
. Python 2:
rdd.map(lambda (x, (k,v)): x + " " + k + " " + str(v))
Python 3:
rdd.map(lambda xkv: xkv[0] + " " + xkv[1][0] + " " + str(xkv[1][1]))
Version independent:
def f(xkv):
(x, (k, v)) = xkv
return x + " " + k + " " + str(v)
rdd.map(f)
To get tuples
replace:
x + " " + k + " " + str(v)
with:
(x, k, v)
or
(xkv[0], xkv[1][0], str(xkv[1][1]))
Upvotes: 3