Vivek Singh
Vivek Singh

Reputation: 685

Scala getting average of the result

Hi I am trying to calculate the average of the movie result from this tsv set

running time        Genre
    1               Documentary,Short
    5               Animation,Short
    4               Animation,Comedy,Romance

Animation is one type of Genre and same goes for Short, Comedy, Romance

I'm new to Scala and I'm confused about how to get an Average as per each genre using Scala without any immutable functions

I tried using this below snippet to just try some sort of iterations and get the runTimes as per each genre

val a = list.foldLeft(Map[String,(Int)]()){
      case (map,arr) =>{
      map + (arr.genres.toString ->(arr.runtimeMinutes))
    }}

Is there any way to calculate the average

Upvotes: 0

Views: 130

Answers (2)

hughj
hughj

Reputation: 285

I tried to break it down as follows: I modeled your data as a List[(Int, String)]:

  val data: List[(Int, List[String])] = List(
    (1, List("Documentary","Short")),
    (5, List("Animation","Short")),
    (4, List("Animation","Comedy","Romance"))
  )

I wrote a function to spread the runtime value across each genre so that I have a value for each one:

val spread: ((Int, List[String]))=>List[(Int, String)] = t => t._2.map((t._1, _))
  // now, if I pass it a tuple, I get:
  //    spread((23, List("one","two","three")))) == List((23,one), (23,two), (23,three))

So far, so good. Now I can use spread with flatMap to get a 2-dimensional list:

val flatData = data.flatMap(spread)
flatData: List[(Int, String)] = List(
  (1, "Documentary"),
  (1, "Short"),
  (5, "Animation"),
  (5, "Short"),
  (4, "Animation"),
  (4, "Comedy"),
  (4, "Romance")
)

Now we can use groupBy to summarize by genre:

flatData.groupBy(_._2)
res26: Map[String, List[(Int, String)]] = HashMap(
  "Animation" -> List((5, "Animation"), (4, "Animation")),
  "Documentary" -> List((1, "Documentary")),
  "Comedy" -> List((4, "Comedy")),
  "Romance" -> List((4, "Romance")),
  "Short" -> List((1, "Short"), (5, "Short"))
)

Finally, I can get the results (it took me about 10 tries):

flatData.groupBy(_._2).map(t => (t._1, t._2.map(_._1).foldLeft(0)(_+_)/t._2.size.toDouble))
res43: Map[String, Double] = HashMap(
  "Animation" -> 4.5,
  "Documentary" -> 1.0,
  "Comedy" -> 4.0,
  "Romance" -> 4.0,
  "Short" -> 3.0
)

The map() after the groupBy() is chunky, but now that I got it, it's easy(er) to explain. Each tuple in the groupBy is (rating, list(genre)). So we just map each tuple and use foldLeft to compute the average of the values in each list. You should coerce the calc to a double, or you'll get integer division.

I think it would have been good to define a cleaner model for the data like Luis did. That would've made all the tuple notation less obscure. Hey, I am learning, too.

Upvotes: 1

Assuming the data was already parsed into something like:

final case class Row(runningTime: Int, genres: List[String])

Then you can follow a declarative approach to compute your desired result.

  1. Flatten a List[Row] into a list of pairs, where the first element is a genre and the second element is a running time.
  2. Collect all running times for the same genre.
  3. Reduce each group to compute its average.
def computeAverageRunningTimePerGenre(data: List[Row]): Map[String, Double] =
  data.flatMap {
    case Row(runningTime, genres) =>
      genres.map(genre => genre -> runningTime)
  }.groupMap(_._1)(_._2).view.mapValues { runningTimes =>
    runningTimes.sum.toDouble / runningTimes.size.toDouble
  }.toMap

Note: There are ways to make this faster but IMHO is better to start with the most readable alternative first and then refactor to something more performant if needed.


You can see the code running here.

Upvotes: 3

Related Questions