Reputation: 129
I am working with Apache Spark for python and have created an spark dataframe with name, latitude, longitude as the column names.
my RDD dataframe is in the form:
name latitude longitude
M 1.3 22.5
S 1.6 22.9
H 1.7 23.4
W 1.4 23.3
C 1.1 21.2
... ... ....
I know that to collect only the latitude I can do
list_of_lat = df.rdd.map(lambda r: r.latitude).collect()
print list_of_lat
[1.3,1.6,1.7,1.4,1.1,...]
However, I need to collect the latitude and longitude values together in a list in the form:
[[1.3,22.5],[1.6,22.9],[1.7,23.4]...]
I have tried
lat_lon = df.rdd.map(lambda r,x : r.latitude, x.longitude).collect()
however this does not work.
I need to use the spark since it is a very large dataset (~1M rows).
Any help would be greatly appreciated. Thanks
Upvotes: 4
Views: 19912
Reputation: 8483
I'm assuming lat_lon = df.rdd.map(lambda r,x : r.latitude, x.longitude).collect()
gave you the following error
NameError: name 'x' is not defined
try
lat_lon = df.rdd.map(lambda x : [x.latitude, x.longitude]).collect()
Upvotes: 6