johnnydonna
johnnydonna

Reputation: 63

Difference between these two count methods in Spark

I have been doing a count of "games" using spark-sql. The first way is like so:

val gamesByVersion = dataframe.groupBy("game_version", "server").count().withColumnRenamed("count", "patch_games")

val games_count1 = gamesByVersion.where($"game_version" === 1 && $"server" === 1)

The second is like this:

val gamesDf = dataframe.
  groupBy($"hero_id", $"position", $"game_version", $"server").count().
  withColumnRenamed("count", "hero_games")

val games_count2 = gamesDf.where($"game_version" === 1 && $"server" === 1).agg(sum("hero_games"))

For all intents and purposes dataframe just has the columns hero_id, position, game_version and server.

However games_count1 ends up being about 10, and games_count2 ends up being 50. Obviously these two counting methods are not equivalent or something else is going on, but I am trying to figure out: what is the reason for the difference between these?

Upvotes: 0

Views: 279

Answers (1)

Tomasz Krol
Tomasz Krol

Reputation: 668

I guess because in first query you group by only 2 columns and in the second 4 columns. Therefore, you may have less distinct groups just on two columns.

Upvotes: 1

Related Questions