Reputation: 63
I have been doing a count of "games" using spark-sql. The first way is like so:
val gamesByVersion = dataframe.groupBy("game_version", "server").count().withColumnRenamed("count", "patch_games")
val games_count1 = gamesByVersion.where($"game_version" === 1 && $"server" === 1)
The second is like this:
val gamesDf = dataframe.
groupBy($"hero_id", $"position", $"game_version", $"server").count().
withColumnRenamed("count", "hero_games")
val games_count2 = gamesDf.where($"game_version" === 1 && $"server" === 1).agg(sum("hero_games"))
For all intents and purposes dataframe
just has the columns hero_id
, position
, game_version
and server
.
However games_count1
ends up being about 10, and games_count2
ends up being 50. Obviously these two counting methods are not equivalent or something else is going on, but I am trying to figure out: what is the reason for the difference between these?
Upvotes: 0
Views: 279
Reputation: 668
I guess because in first query you group by only 2 columns and in the second 4 columns. Therefore, you may have less distinct groups just on two columns.
Upvotes: 1