pang sai
pang sai

Reputation: 25

Group column by year range pyspark dataframe

I am trying to make my year column into year range instead of the value being a specific year. This is a movie dataset.

This is my code:

group =join_DF.groupby("relYear").avg("rating").withColumnRenamed("relYear", "year_range") 
group.show()

This is what i have right now:

+----------+------------------+
|year_range|       avg(rating)|
+----------+------------------+
|      1953|3.7107686857952533|
|      1903|3.0517241379310347|
|      1957|3.9994918537809254|
|      1897|2.9177215189873418|
|      1987|3.5399940908663594|
|      1956|3.7077949616896153|
|      2016|3.5318961695914055|
|      1936|3.8356813313560724|
|      2012|3.5490157995509457|
|          |3.5151401495104130|
+----------+------------------+

This is what i want to achieve:

+-----------------+------------------+
|  year_range     |   avg(rating)    |
+-----------------+------------------+
| 1970-1979       |3.7773614199240319|
| 1960-1969       |3.8007319471419123|
|                 |3.5455419410410923|
| 1980-1989       |3.5778570247142313|
| 2000 onwards    |3.5009940908663594|
| 1959 and earlier|3.8677949616896153|
| 1990-1999       |3.4618961695914055|
+-----------------+------------------+

The year_range that is null are movie title without release year stated. 1874 is earliest year and 2019 is the latest year.

Upvotes: 0

Views: 127

Answers (1)

Emma
Emma

Reputation: 9308

You can divide the year by 10 to make your bin.

df = (df.withColumn('bin', F.floor(F.col('year_range') / 10))
        .withColumn('bin', F.when(F.col('bin') >= 200, 200)
                            .when(F.col('bin') <= 195, 195)
                            .otherwise(F.col('bin')))
      .groupby('bin')
      .agg(F.avg('avg(rating)').alias('10_years_avg')))

Then, format your year_range column.

df = df.withColumn('year_range', F.when(F.col('bin') >= 200, F.lit('2000 onwards'))
                                  .when(F.col('bin') <= 195, F.lit('1959 and earlier'))
                                  .otherwise(F.concat(F.col('bin'), F.lit('0-'), F.col('bin'), F.lit('9'))))

Upvotes: 1

Related Questions