dannylin
dannylin

Reputation: 25

Apply function over Spark dataset

Assume I have this in pyspark:

data = [{"age":1,"count":10},{"age":2,"count":20},{"age":3,"count":30}]

rdd = sc.parallelize( data )

I want to make "count" + 10 if "age" is larger than 2. Like this :

data = [{"age":1,"count":10},{"age":2,"count":20},{"age":3,"count":40}]

How to achieve this?

Upvotes: 2

Views: 3080

Answers (2)

koiralo
koiralo

Reputation: 23119

You can convert to Dataframe and it's much easier

df = rdd.toDF()

df.withColumn("count", when(df['age'] > 2, df['count'] + 10).otherwise(df['count'])).show(truncate=False)

Output:

+---+-----+
|age|count|
+---+-----+
|1  |10   |
|2  |30   |
|3  |40   |
+---+-----+

Upvotes: 4

mursalin
mursalin

Reputation: 1181

There might be better solution. This one works for me.

def add_count(x):
    x['count']+=10
    return x
    
new_data = list(map(lambda x: x if x['age']<=2 else add_count(x), data))

Upvotes: 4

Related Questions