cool dude
cool dude

Reputation: 27

Spark groupBy agg not working as expected

I am getting similar issue:

(df
    .groupBy("email")
    .agg(last("user_id") as "user_id")
    .select("user_id").count,
df
    .groupBy("email")
    .agg(last("user_id") as "user_id")
    .select("user_id")
    .distinct
    .count)

When run on one computer it gives: (15123144,15123144)

When run on cluster it gives: (15123144,24)

The first one is expected and looks correct but second one is horribly wrong. One more observation - even if I change data where total count is more/less than 15123144 I get distinct = 24 on cluster. Even if I interchange user_id and email, it gives same distinct count.

I am more confused by seeing: https://spark.apache.org/docs/1.5.2/api/scala/index.html#org.apache.spark.sql.DataFrame

Agg doc says: Aggregates on the entire DataFrame without groups. "Without group"? what does that mean?

Any clue? or Jira ticket? or what can be fix for now?

Upvotes: 1

Views: 1656

Answers (1)

zero323
zero323

Reputation: 330413

Lets start with "without group" part. As it is described in the docs:

df.agg(...) is a shorthand for df.groupBy().agg(...)

If it is still not clear it translates to SQL:

SELECT SOME_AGGREGATE_FUNCTION(some_column) FROM table

Regarding your second problem it is hard to give you a good answer without an access to the data but generally speaking these two queries are not equivalent. The first simply counts distinct email values, the second one count unique values of the last user_id per email. Moreover last without explicit ordering is meaningless.

Upvotes: 1

Related Questions