Difference between these two count methods in Spark

Question

I have been doing a count of "games" using spark-sql. The first way is like so:

val gamesByVersion = dataframe.groupBy("game_version", "server").count().withColumnRenamed("count", "patch_games")

val games_count1 = gamesByVersion.where($"game_version" === 1 && $"server" === 1)

The second is like this:

val gamesDf = dataframe.
  groupBy($"hero_id", $"position", $"game_version", $"server").count().
  withColumnRenamed("count", "hero_games")

val games_count2 = gamesDf.where($"game_version" === 1 && $"server" === 1).agg(sum("hero_games"))

For all intents and purposes dataframe just has the columns hero_id, position, game_version and server.

However games_count1 ends up being about 10, and games_count2 ends up being 50. Obviously these two counting methods are not equivalent or something else is going on, but I am trying to figure out: what is the reason for the difference between these?

Difference between these two count methods in Spark

Answers (1)

Related Questions