muni
muni

Reputation: 1403

getting length of each list within an RDD object

I have an RDD object like:

x=[[1,2,3],[4,5,6,7],[7,2,6,9,10]..]

I want to get a list out of it, which is equal to size of the elements within each list:

y=[3,4,5..]

Where 3=len([1,2,3], 4=len([4,5,6,7]), 5=len([7,2,6,9,10])...

This syntax works in Python:

[ len(y) for y in yourlist ], how to iterate over RDD

Upvotes: 1

Views: 4063

Answers (2)

TMichel
TMichel

Reputation: 4442

Create a Dataframe from your RDD and then you can use the size() sql function.

df = spark.createDataFrame([([1, 2, 3],),([4,5,6,7],),([7,2,6,9,10],)], ['data'])
df.select(size(df.data)).collect()
#[Row(size(data)=3), Row(size(data)=4), Row(size(data)=5)]

Update

You can create a Dataframe from your original RDD like this:

from pyspark.sql import Row

rowrdd = rdd.map(lambda x: Row(data=x))
df = spark.createDataFrame(rowrdd)
...

Upvotes: 0

desertnaut
desertnaut

Reputation: 60318

You just need to perform a map operation in your RDD:

x = [[1,2,3], [4,5,6,7], [7,2,6,9,10]]
rdd = sc.parallelize(x)
rdd_length = rdd.map(lambda x: len(x))
rdd_length.collect()
# [3, 4, 5]

Upvotes: 3

Related Questions