How to filter out RDDs based on multiple conditions?

Question

For illustrations purpose, I have a dataset with 3 columns (X, Y, Z). Now, I want to calculate Total z or Avg z value for a year between 2001 and 2008.

To filter out year, I know:

ps2 = ps1.filter(lambda x: int(x[0])>2001 and int(x[0])<2008)

But how to create a new column with total_z or avg_z values for each year?

Erik Maruškin · Accepted Answer

I am not sure if you just want the average value for every year, but if you do, use the simple aggregation:

 p2.groupby('X').avg('Z')

which give you a result:

+----+------+
|   X|avg(Z)|
+----+------+
|2003| 600.0|
|2002| 262.5|
+----+------+

If you need to keep Y column and duplicate the same average result like this:

+----+---+---+-----+
|   X|  Y|  Z|  avg|
+----+---+---+-----+
|2003| FL|600|600.0|
|2002| NY|300|262.5|
|2002| AZ|225|262.5|
+----+---+---+-----+

this code should help you:

    p2 = df.filter((df['X'] > 2001) & (df['X'] < 2008))
    partitioned = Window.partitionBy('X')
    result = p2.withColumn('avg', avg('Z').over(partitioned))
    result.show()

How to filter out RDDs based on multiple conditions?

Answers (1)

Related Questions