Philippe Hebert
Philippe Hebert

Reputation: 2028

googleapis / python-bigquery: Client.load_dataframe_to_table fails with PyArrow "TypeError: an integer is required (got type str)"

Given the following code:

try:
    dest_table = bigquery.table.Table(table_id, schema)
    job = self.client.load_table_from_dataframe(
        df_data, # pd.DataFrame
        dest_table,
        job_config=bigquery.job.LoadJobConfig(schema=schema)
    )
    job.result()
except TypeError:
    with pd.option_context("display.max_rows", None, "display.max_columns", None, "display.width", None):
        LOG.error("Failed to upload dataframe: \n\n%s\n", df_data.to_csv(header=True, index=False, quoting=csv.QUOTE_NONNUMERIC))
        LOG.error("\n%s\n", df_data.dtypes)
        if schema:
            LOG.error(
                "schema: \n\n%s", 
                ('[\n  ' + ',\n  '.join(json.dumps(field) for field in schema) + '\n]\n')
            )
        LOG.error(f"dest_table_id: {dest_table_id}")
    raise

BigQuery's Client.load_table_from_dataframe raises from pyarrow with:

dags/utils/database/_bigquery.py:257: in load_file_to_table
    job = self.client.load_table_from_dataframe(
lib/python3.8/site-packages/google/cloud/bigquery/client.py:2233: in load_table_from_dataframe
    _pandas_helpers.dataframe_to_parquet(
lib/python3.8/site-packages/google/cloud/bigquery/_pandas_helpers.py:486: in dataframe_to_parquet
    arrow_table = dataframe_to_arrow(dataframe, bq_schema)
lib/python3.8/site-packages/google/cloud/bigquery/_pandas_helpers.py:450: in dataframe_to_arrow
    bq_to_arrow_array(get_column_or_index(dataframe, bq_field.name), bq_field)
lib/python3.8/site-packages/google/cloud/bigquery/_pandas_helpers.py:224: in bq_to_arrow_array
    return pyarrow.Array.from_pandas(series, type=arrow_type)
pyarrow/array.pxi:859: in pyarrow.lib.Array.from_pandas
    ???
pyarrow/array.pxi:265: in pyarrow.lib.array
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   TypeError: an integer is required (got type str)

Yet, when investigating with the ERROR debug prints, it seems like the provided DataFrame matches the expected schema of every field provided for the table:

ERROR    utils.database._bigquery:_bigquery.py:265 Failed to upload dataframe:

"clinic_key","schedule_template_time_interval_key","schedule_template_key","date_key","schedule_owner_key","schedule_template_schedule_track_key","schedule_content_label_key","start_time_key","end_time_key","priority"
"clitest11111111111111111111111","1","1","2021-01-01","1","1","1","19:00:00","21:00:00",1
"clitest11111111111111111111111","1","1","2021-01-01","1","1","2","20:00:00","20:30:00",2
"clitest11111111111111111111111","1","1","2021-01-01","1","1","3","20:20:00","20:30:00",3

ERROR    utils.database._bigquery:_bigquery.py:266
clinic_key                              object
schedule_template_time_interval_key     object
schedule_template_key                   object
date_key                                object
schedule_owner_key                      object
schedule_template_schedule_track_key    object
schedule_content_label_key              object
start_time_key                          object
end_time_key                            object
priority                                 int64
dtype: object

ERROR    utils.database._bigquery:_bigquery.py:268 schema:

[
  {"name": "clinic_key", "type": "STRING", "mode": "NULLABLE"},
  {"name": "schedule_template_time_interval_key", "type": "STRING", "mode": "NULLABLE"},
  {"name": "schedule_template_key", "type": "STRING", "mode": "NULLABLE"},
  {"name": "date_key", "type": "DATE", "mode": "NULLABLE"},
  {"name": "schedule_owner_key", "type": "STRING", "mode": "NULLABLE"},
  {"name": "schedule_template_schedule_track_key", "type": "STRING", "mode": "NULLABLE"},
  {"name": "schedule_content_label_key", "type": "STRING", "mode": "NULLABLE"},
  {"name": "start_time_key", "type": "TIME", "mode": "NULLABLE"},
  {"name": "end_time_key", "type": "TIME", "mode": "NULLABLE"},
  {"name": "priority", "type": "INT64", "mode": "NULLABLE"}
]

I tried converting the (start_time_key, end_time_key) fields to INT (number of seconds since start of day) before calling load_table_from_dataframe, but it did not solve the issue. Besides that I am a stumped; I don't understand which field is supposed to be an integer but is a string instead.

How can I solve this?

P.S.: I attempted a different approach to this, using load_file_to_table, but I faced another issue. Here's the link to the other issue.

Upvotes: 1

Views: 1887

Answers (1)

Sandeep Mohanty
Sandeep Mohanty

Reputation: 1552

As you are trying to load data from a dataframe to BigQuery, I tried to replicate your use case on my end by referring to this document using the Python client library.

The example provided by Google is working as expected. When I tried with the schema you have provided I got the same error i.e "TypeError: an integer is required (got type str)"

This error is occurring because for the data type you have provided in your schema and value provided in the column fields i.e

"date_key", "type": "DATE"
"start_time_key", "type": "TIME"
"end_time_key", "type": "TIME"

In these columns you have passed data type as Date and Time respectively but when passing the values you are providing it as a String for which you are getting the TypeError i.e

"Date_key": "2021-01-01", "start_time_key": "19:00:00" , "end_time_key" : "21:00:00"

You can refer to the below code snippet and the data frame. I have replicated on my end and it is working.

loaddata.py:

import datetime

from google.cloud import bigquery
import pandas
import pytz

# Construct a BigQuery client object.
client = bigquery.Client()

# TODO(developer): Set table_id to the ID of the table to create.
table_id = "myproject.dataset1.tab4"

records = [
   {
      "clinic_key": u"cli101",
      "schedule_template_time_interval_key":"1",
      "schedule_template_key":"1",
      "date_key":datetime.date(2021,10,5),
      "schedule_owner_key":"1",
      "schedule_template_schedule_track_key":"1",
      "schedule_content_label_key":"1",
      "start_time_key":datetime.time(19,5,00),
      "end_time_key":datetime.time(20,00,00),
      "priority":1,

   },

{
      "clinic_key": u"cli102",
      "schedule_template_time_interval_key":"2",
      "schedule_template_key":"2",
      "date_key":datetime.date(2021,10,6),
      "schedule_owner_key":"2",
      "schedule_template_schedule_track_key":"2",
      "schedule_content_label_key":"2",
      "start_time_key":datetime.time(16,10,00),
      "end_time_key":datetime.time(16,50,00),
      "priority":2,

   },
   {
      "clinic_key": u"cli103",
      "schedule_template_time_interval_key":"3",
      "schedule_template_key":"3",
      "date_key":datetime.date(2021,10,7),
      "schedule_owner_key":"3",
      "schedule_template_schedule_track_key":"3",
      "schedule_content_label_key":"3",
      "start_time_key":datetime.time(19,10,00),
      "end_time_key":datetime.time(20,00,00),
      "priority":1,

   },
   {
      "clinic_key": u"cli104",
      "schedule_template_time_interval_key":"4",
      "schedule_template_key":"4",
      "date_key":datetime.date(2021,10,8),
      "schedule_owner_key":"4",
      "schedule_template_schedule_track_key":"4",
      "schedule_content_label_key":"4",
      "start_time_key":datetime.time(20,40,00),
      "end_time_key":datetime.time(21,15,00),
      "priority":3,

   },
]
dataframe = pandas.DataFrame(
   records,
   # In the loaded table, the column order reflects the order of the
   # columns in the DataFrame.
   columns=[
"clinic_key",                            
"schedule_template_time_interval_key",  
"schedule_template_key",                  
"date_key",                             
"schedule_owner_key",                     
"schedule_template_schedule_track_key", 
"schedule_content_label_key",             
"start_time_key",                        
"end_time_key",                          
"priority",                               
   ],
   # Optionally, set a named index, which can also be written to the
   # BigQuery table.
  
)
job_config = bigquery.LoadJobConfig(
   # Specify a (partial) schema. All columns are always written to the
   # table. The schema is used to assist in data type definitions.
   schema=[
       # Specify the type of columns whose type cannot be auto-detected. For
       # example the "title" column uses pandas dtype "object", so its
       # data type is ambiguous.
       bigquery.SchemaField("clinic_key", "STRING"),
       bigquery.SchemaField("schedule_template_time_interval_key","STRING"),
       bigquery.SchemaField("schedule_template_key","STRING"),
       bigquery.SchemaField("schedule_owner_key","STRING"),
       bigquery.SchemaField("schedule_template_schedule_track_key","STRING"),
       bigquery.SchemaField("schedule_content_label_key","STRING"),
       bigquery.SchemaField("priority","INTEGER"),   
      
   ],
      write_disposition="WRITE_TRUNCATE",
)

job = client.load_table_from_dataframe(
   dataframe, table_id, job_config=job_config
)  # Make an API request.
job.result()  # Wait for the job to complete.

table = client.get_table(table_id)  # Make an API request.
print(
   "Loaded {} rows and {} columns to {}".format(
       table.num_rows, len(table.schema), table_id
   )
)

Sample dataframe:

      "clinic_key": u"cli101",
      "schedule_template_time_interval_key":"1",
      "schedule_template_key":"1",
      "date_key":datetime.date(2021,10,5),
      "schedule_owner_key":"1",
      "schedule_template_schedule_track_key":"1",
      "schedule_content_label_key":"1",
      "start_time_key":datetime.time(19,5,00),
      "end_time_key":datetime.time(20,00,00),
      "priority":1,

When you are using DATE and TIME in schema the values in the column field should be passed in datetime.date(2021,2,21) and datetime.time(16,00,00) format respectively.

Schema Output :

enter image description here

Query Output :

enter image description here

Upvotes: 2

Related Questions