sanjay
sanjay

Reputation: 354

Google Big Query schema inference is incorrect

When pushing in data in csv format as given below:

G000021318, 17.0, New, 0.0, None, jan, 2010

Big query removes the G00000 and converts the field to an Integer.

Code to create the table as follows:

List<String> sources = new ArrayList<String>();
    sources.add("gs://" + googleBucket + "/" + accountId + "/" + sourceFile + "_" + account.getSuffix() + "/part*");
    loadConfig.setSourceUris(sources);
    TableReference tableRef = new TableReference();
    tableRef.setDatasetId(datasetId);
    tableRef.setTableId(flagVolumeMonthTable + "_" + account.getSuffix());
    tableRef.setProjectId(googleProjectId);
    loadConfig.setDestinationTable(tableRef);
    loadConfig.setFieldDelimiter(",");
    loadConfig.setAutodetect(true);

Am I missing something or it is a bug in the Big query schema detection?

Upvotes: 2

Views: 1685

Answers (1)

Mosha Pasumansky
Mosha Pasumansky

Reputation: 14014

The problem happened because BigQuery's autodetect code detected that G000021318 is ISO compliant format for the Haitian gourde currency due to G prefix, and eagerly proceeded to interpret data as INT64 representing 21318 gourdes :) We have fixed autodetect code to only react for unambiguous currency symbols such as $, €, £, ¥, ¢ etc. P.S. The fix will propagate into production systems within weeks.

Upvotes: 4

Related Questions