Reputation: 25
I am trying to make my year column into year range instead of the value being a specific year. This is a movie dataset.
This is my code:
group =join_DF.groupby("relYear").avg("rating").withColumnRenamed("relYear", "year_range")
group.show()
This is what i have right now:
+----------+------------------+
|year_range| avg(rating)|
+----------+------------------+
| 1953|3.7107686857952533|
| 1903|3.0517241379310347|
| 1957|3.9994918537809254|
| 1897|2.9177215189873418|
| 1987|3.5399940908663594|
| 1956|3.7077949616896153|
| 2016|3.5318961695914055|
| 1936|3.8356813313560724|
| 2012|3.5490157995509457|
| |3.5151401495104130|
+----------+------------------+
This is what i want to achieve:
+-----------------+------------------+
| year_range | avg(rating) |
+-----------------+------------------+
| 1970-1979 |3.7773614199240319|
| 1960-1969 |3.8007319471419123|
| |3.5455419410410923|
| 1980-1989 |3.5778570247142313|
| 2000 onwards |3.5009940908663594|
| 1959 and earlier|3.8677949616896153|
| 1990-1999 |3.4618961695914055|
+-----------------+------------------+
The year_range that is null are movie title without release year stated. 1874 is earliest year and 2019 is the latest year.
Upvotes: 0
Views: 127
Reputation: 9308
You can divide the year by 10 to make your bin.
df = (df.withColumn('bin', F.floor(F.col('year_range') / 10))
.withColumn('bin', F.when(F.col('bin') >= 200, 200)
.when(F.col('bin') <= 195, 195)
.otherwise(F.col('bin')))
.groupby('bin')
.agg(F.avg('avg(rating)').alias('10_years_avg')))
Then, format your year_range
column.
df = df.withColumn('year_range', F.when(F.col('bin') >= 200, F.lit('2000 onwards'))
.when(F.col('bin') <= 195, F.lit('1959 and earlier'))
.otherwise(F.concat(F.col('bin'), F.lit('0-'), F.col('bin'), F.lit('9'))))
Upvotes: 1