VMEscoli
VMEscoli

Reputation: 1232

Reading an EXCEL file from bucket into Bigquery

I'm trying to upload this data into from my bucket into bigquery, but it's complaining. My file is an excel file.

ID          A   B   C          D             E      F           Value1  Value2
333344      ALG A   RETAIL     OTHER         YIPP   Jun 2019    123     4
34563       ALG A   NON-RETAIL OTHER         TATS   Mar 2019    124     0
7777777777  -   E   RETAIL     NASAL         KHPO   Jul 2020    1,448   0
7777777777  -   E   RETAIL     SEVERE ASTHMA PZIFER Oct 2019    1,493   162

From python I call the file as follow:

    from google.cloud import bigquery
    
    # Construct a BigQuery client object.
    client = bigquery.Client()
    
    table_id = "project.dataset.my_table"
    
    job_config = bigquery.LoadJobConfig(
        schema=[
       bigquery.SchemaField('ID','STRING'),
            bigquery.SchemaField('A','STRING'),
    bigquery.SchemaField('B','STRING'),
    bigquery.SchemaField('C','STRING'),
    bigquery.SchemaField('D','STRING'),
    bigquery.SchemaField('E','STRING'),
    bigquery.SchemaField('F','STRING'),
    bigquery.SchemaField('Value1','STRING'),
    bigquery.SchemaField('Value2','STRING'),
    
        ],
        skip_leading_rows=1
        )
    
    uri = "gs://bucket/folder/file.xlsx"
    
    load_job = client.load_table_from_uri(
        uri, table_id, job_config=job_config
    )  # Make an API request.
    
    load_job.result()  # Wait for the job to complete.
    
    table = client.get_table(table_id)
    print("Loaded {} rows to table {}".format(table.num_rows, table_id))

I am getting the following error, and its complaining about a line that it's not even there.

BadRequest: 400 Error while reading data, error message: CSV table references column position 8, but line starting at position:660 contains only 1 columns.

I thought the problem was the data type, as I had selected ID as integer and value1 and value2 as Integer too and F as timestamp, so now I'm trying everything as String, and I still get the error.

My file is only 4 lines in this test I'm doing

Upvotes: 1

Views: 1949

Answers (3)

Cylldby
Cylldby

Reputation: 1978

Excel files are not supported by BigQuery.

A few workaround solutions:

  • Upload a CSV version of your file into your bucket (a simple bq load command will do, cf here),
  • Read the Excel file with Pandas in your python script and insert the rows in BQ with the to_gbq() function,
  • Upload your Excel file in your Google Drive, make a spreadsheet version out of it and make an external table linked to that spreadsheet.

Upvotes: 2

takryo
takryo

Reputation: 439

Your input file seems like TSV. So you need to set field delimiter to '\t' like this,

job_config = bigquery.LoadJobConfig(
  schema=[
    bigquery.SchemaField('ID','STRING'),
    bigquery.SchemaField('A','STRING'),
    bigquery.SchemaField('B','STRING'),
    bigquery.SchemaField('C','STRING'),
    bigquery.SchemaField('D','STRING'),
    bigquery.SchemaField('E','STRING'),
    bigquery.SchemaField('F','STRING'),
    bigquery.SchemaField('Value1','STRING'),
    bigquery.SchemaField('Value2','STRING'),
  ],
  skip_leading_rows=1,
  field_delimiter='\t'
)

Upvotes: 0

Sergey Geron
Sergey Geron

Reputation: 10172

Try specifying field_delimiter in the LoadJobConfig

Upvotes: 0

Related Questions