msharky
msharky

Reputation: 129

How to combine and collect elements of an RDD into a list in pyspark

I am working with Apache Spark for python and have created an spark dataframe with name, latitude, longitude as the column names.

my RDD dataframe is in the form:

name     latitude      longitude

M          1.3           22.5
S          1.6           22.9
H          1.7           23.4
W          1.4           23.3
C          1.1           21.2
...        ...           ....

I know that to collect only the latitude I can do

list_of_lat = df.rdd.map(lambda r: r.latitude).collect()

print list_of_lat

[1.3,1.6,1.7,1.4,1.1,...]

However, I need to collect the latitude and longitude values together in a list in the form:

[[1.3,22.5],[1.6,22.9],[1.7,23.4]...]

I have tried

lat_lon = df.rdd.map(lambda r,x : r.latitude, x.longitude).collect()

however this does not work.

I need to use the spark since it is a very large dataset (~1M rows).

Any help would be greatly appreciated. Thanks

Upvotes: 4

Views: 19912

Answers (1)

Bob Haffner
Bob Haffner

Reputation: 8483

I'm assuming lat_lon = df.rdd.map(lambda r,x : r.latitude, x.longitude).collect()

gave you the following error NameError: name 'x' is not defined

try

lat_lon = df.rdd.map(lambda x : [x.latitude, x.longitude]).collect()

Upvotes: 6

Related Questions