Reputation: 25
Assume I have this in pyspark:
data = [{"age":1,"count":10},{"age":2,"count":20},{"age":3,"count":30}]
rdd = sc.parallelize( data )
I want to make "count" + 10 if "age" is larger than 2. Like this :
data = [{"age":1,"count":10},{"age":2,"count":20},{"age":3,"count":40}]
How to achieve this?
Upvotes: 2
Views: 3080
Reputation: 23119
You can convert to Dataframe
and it's much easier
df = rdd.toDF()
df.withColumn("count", when(df['age'] > 2, df['count'] + 10).otherwise(df['count'])).show(truncate=False)
Output:
+---+-----+
|age|count|
+---+-----+
|1 |10 |
|2 |30 |
|3 |40 |
+---+-----+
Upvotes: 4
Reputation: 1181
There might be better solution. This one works for me.
def add_count(x):
x['count']+=10
return x
new_data = list(map(lambda x: x if x['age']<=2 else add_count(x), data))
Upvotes: 4