Reputation: 685
Hi I am trying to calculate the average of the movie result from this tsv set
running time Genre
1 Documentary,Short
5 Animation,Short
4 Animation,Comedy,Romance
Animation is one type of Genre and same goes for Short, Comedy, Romance
I'm new to Scala and I'm confused about how to get an Average as per each genre using Scala without any immutable functions
I tried using this below snippet to just try some sort of iterations and get the runTimes as per each genre
val a = list.foldLeft(Map[String,(Int)]()){
case (map,arr) =>{
map + (arr.genres.toString ->(arr.runtimeMinutes))
}}
Is there any way to calculate the average
Upvotes: 0
Views: 130
Reputation: 285
I tried to break it down as follows: I modeled your data as a List[(Int, String)]:
val data: List[(Int, List[String])] = List(
(1, List("Documentary","Short")),
(5, List("Animation","Short")),
(4, List("Animation","Comedy","Romance"))
)
I wrote a function to spread the runtime value across each genre so that I have a value for each one:
val spread: ((Int, List[String]))=>List[(Int, String)] = t => t._2.map((t._1, _))
// now, if I pass it a tuple, I get:
// spread((23, List("one","two","three")))) == List((23,one), (23,two), (23,three))
So far, so good. Now I can use spread with flatMap to get a 2-dimensional list:
val flatData = data.flatMap(spread)
flatData: List[(Int, String)] = List(
(1, "Documentary"),
(1, "Short"),
(5, "Animation"),
(5, "Short"),
(4, "Animation"),
(4, "Comedy"),
(4, "Romance")
)
Now we can use groupBy to summarize by genre:
flatData.groupBy(_._2)
res26: Map[String, List[(Int, String)]] = HashMap(
"Animation" -> List((5, "Animation"), (4, "Animation")),
"Documentary" -> List((1, "Documentary")),
"Comedy" -> List((4, "Comedy")),
"Romance" -> List((4, "Romance")),
"Short" -> List((1, "Short"), (5, "Short"))
)
Finally, I can get the results (it took me about 10 tries):
flatData.groupBy(_._2).map(t => (t._1, t._2.map(_._1).foldLeft(0)(_+_)/t._2.size.toDouble))
res43: Map[String, Double] = HashMap(
"Animation" -> 4.5,
"Documentary" -> 1.0,
"Comedy" -> 4.0,
"Romance" -> 4.0,
"Short" -> 3.0
)
The map() after the groupBy() is chunky, but now that I got it, it's easy(er) to explain. Each tuple in the groupBy is (rating, list(genre)). So we just map each tuple and use foldLeft to compute the average of the values in each list. You should coerce the calc to a double, or you'll get integer division.
I think it would have been good to define a cleaner model for the data like Luis did. That would've made all the tuple notation less obscure. Hey, I am learning, too.
Upvotes: 1
Reputation: 22850
Assuming the data was already parsed into something like:
final case class Row(runningTime: Int, genres: List[String])
Then you can follow a declarative approach to compute your desired result.
List[Row]
into a list of pairs, where the first element is a genre and the second element is a running time.def computeAverageRunningTimePerGenre(data: List[Row]): Map[String, Double] =
data.flatMap {
case Row(runningTime, genres) =>
genres.map(genre => genre -> runningTime)
}.groupMap(_._1)(_._2).view.mapValues { runningTimes =>
runningTimes.sum.toDouble / runningTimes.size.toDouble
}.toMap
Note: There are ways to make this faster but IMHO is better to start with the most readable alternative first and then refactor to something more performant if needed.
You can see the code running here.
Upvotes: 3