lourdu rajan
lourdu rajan

Reputation: 379

BeamSQL Group By query problem with Float value

Tried to get the unique value from the BigQuery table using BeamSQL in Google Dataflow. Using Group By clause implemented the condition in BeamSQL (sample query below). One of the column has float data type. While executing the Job got below exceptions,

Caused by: org.apache.beam.sdk.coders.Coder$NonDeterministicException: org.apache.beam.sdk.coders.RowCoder@81d6d10 is not deterministic because: All fields must have deterministic encoding. Caused by: org.apache.beam.sdk.coders.Coder$NonDeterministicException: FloatCoder is not deterministic because: Floating point encodings are not guaranteed to be deterministic.

BeamSQL Query:

PCollection ST= mainColl.apply(SqlTransform.query("SELECT ID,ITEM,UNITPRICE FROM PCOLLECTION GROUP BY ID,ITEM,UNITPRICE"));

It would be good if some one help me to solve this issue.

Please note, if we remove the float column then BeamSQL query works fine.

Upvotes: 1

Views: 633

Answers (1)

saifuddin778
saifuddin778

Reputation: 7277

This is indicating that you should not use floating point values (in this case probably UNITPRICE value) in the aggregation (group by) scheme, because their output is non-deterministic (i.e. it can change based on the precision change). For instance, consider this example:

WITH
  data AS (
  SELECT 100 AS id, 'abc' as item, 0.3448473362800000001 AS unitprice
  UNION ALL
  SELECT 200 AS id, 'xyz' as item, 0.49300013 AS unitprice
  UNION ALL
  SELECT 500 AS id, 'pqr' as item, 0.67322332200000212 AS unitprice
)
select id, item, unitprice from data
group by id, item, unitprice

Output for this comes to be:

100 abc 0.34484733628    
200 xyz 0.49300013   
500 pqr 0.6732233220000021

in which, the unitprice values look a bit different.

To avoid this, you can go two routes:

  • You can cast your unitprice as a string and then continue with grouping. Something like cast(unitprice as string) as unitprice in your query.
  • You can simply choose to keep unitprice as non-grouped entity (which is a logical option in most cases), and just do max(unitprice) as unitprice or avg(unitprice) as unitprice in your query, while grouping by id, item.

Hope this helps.

Upvotes: 3

Related Questions