Reputation: 2028
Given the following code:
try:
dest_table = bigquery.table.Table(table_id, schema)
job = self.client.load_table_from_dataframe(
df_data, # pd.DataFrame
dest_table,
job_config=bigquery.job.LoadJobConfig(schema=schema)
)
job.result()
except TypeError:
with pd.option_context("display.max_rows", None, "display.max_columns", None, "display.width", None):
LOG.error("Failed to upload dataframe: \n\n%s\n", df_data.to_csv(header=True, index=False, quoting=csv.QUOTE_NONNUMERIC))
LOG.error("\n%s\n", df_data.dtypes)
if schema:
LOG.error(
"schema: \n\n%s",
('[\n ' + ',\n '.join(json.dumps(field) for field in schema) + '\n]\n')
)
LOG.error(f"dest_table_id: {dest_table_id}")
raise
BigQuery's Client.load_table_from_dataframe
raises from pyarrow with:
dags/utils/database/_bigquery.py:257: in load_file_to_table
job = self.client.load_table_from_dataframe(
lib/python3.8/site-packages/google/cloud/bigquery/client.py:2233: in load_table_from_dataframe
_pandas_helpers.dataframe_to_parquet(
lib/python3.8/site-packages/google/cloud/bigquery/_pandas_helpers.py:486: in dataframe_to_parquet
arrow_table = dataframe_to_arrow(dataframe, bq_schema)
lib/python3.8/site-packages/google/cloud/bigquery/_pandas_helpers.py:450: in dataframe_to_arrow
bq_to_arrow_array(get_column_or_index(dataframe, bq_field.name), bq_field)
lib/python3.8/site-packages/google/cloud/bigquery/_pandas_helpers.py:224: in bq_to_arrow_array
return pyarrow.Array.from_pandas(series, type=arrow_type)
pyarrow/array.pxi:859: in pyarrow.lib.Array.from_pandas
???
pyarrow/array.pxi:265: in pyarrow.lib.array
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E TypeError: an integer is required (got type str)
Yet, when investigating with the ERROR debug prints, it seems like the provided DataFrame matches the expected schema of every field provided for the table:
ERROR utils.database._bigquery:_bigquery.py:265 Failed to upload dataframe:
"clinic_key","schedule_template_time_interval_key","schedule_template_key","date_key","schedule_owner_key","schedule_template_schedule_track_key","schedule_content_label_key","start_time_key","end_time_key","priority"
"clitest11111111111111111111111","1","1","2021-01-01","1","1","1","19:00:00","21:00:00",1
"clitest11111111111111111111111","1","1","2021-01-01","1","1","2","20:00:00","20:30:00",2
"clitest11111111111111111111111","1","1","2021-01-01","1","1","3","20:20:00","20:30:00",3
ERROR utils.database._bigquery:_bigquery.py:266
clinic_key object
schedule_template_time_interval_key object
schedule_template_key object
date_key object
schedule_owner_key object
schedule_template_schedule_track_key object
schedule_content_label_key object
start_time_key object
end_time_key object
priority int64
dtype: object
ERROR utils.database._bigquery:_bigquery.py:268 schema:
[
{"name": "clinic_key", "type": "STRING", "mode": "NULLABLE"},
{"name": "schedule_template_time_interval_key", "type": "STRING", "mode": "NULLABLE"},
{"name": "schedule_template_key", "type": "STRING", "mode": "NULLABLE"},
{"name": "date_key", "type": "DATE", "mode": "NULLABLE"},
{"name": "schedule_owner_key", "type": "STRING", "mode": "NULLABLE"},
{"name": "schedule_template_schedule_track_key", "type": "STRING", "mode": "NULLABLE"},
{"name": "schedule_content_label_key", "type": "STRING", "mode": "NULLABLE"},
{"name": "start_time_key", "type": "TIME", "mode": "NULLABLE"},
{"name": "end_time_key", "type": "TIME", "mode": "NULLABLE"},
{"name": "priority", "type": "INT64", "mode": "NULLABLE"}
]
I tried converting the (start_time_key, end_time_key) fields to INT (number of seconds since start of day) before calling load_table_from_dataframe, but it did not solve the issue. Besides that I am a stumped; I don't understand which field is supposed to be an integer but is a string instead.
How can I solve this?
P.S.: I attempted a different approach to this, using load_file_to_table, but I faced another issue. Here's the link to the other issue.
Upvotes: 1
Views: 1887
Reputation: 1552
As you are trying to load data from a dataframe to BigQuery, I tried to replicate your use case on my end by referring to this document using the Python client library.
The example provided by Google is working as expected. When I tried with the schema you have provided I got the same error i.e "TypeError: an integer is required (got type str)"
This error is occurring because for the data type you have provided in your schema and value provided in the column fields i.e
"date_key", "type": "DATE"
"start_time_key", "type": "TIME"
"end_time_key", "type": "TIME"
In these columns you have passed data type as Date and Time respectively but when passing the values you are providing it as a String for which you are getting the TypeError i.e
"Date_key": "2021-01-01", "start_time_key": "19:00:00" , "end_time_key" : "21:00:00"
You can refer to the below code snippet and the data frame. I have replicated on my end and it is working.
loaddata.py:
import datetime
from google.cloud import bigquery
import pandas
import pytz
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "myproject.dataset1.tab4"
records = [
{
"clinic_key": u"cli101",
"schedule_template_time_interval_key":"1",
"schedule_template_key":"1",
"date_key":datetime.date(2021,10,5),
"schedule_owner_key":"1",
"schedule_template_schedule_track_key":"1",
"schedule_content_label_key":"1",
"start_time_key":datetime.time(19,5,00),
"end_time_key":datetime.time(20,00,00),
"priority":1,
},
{
"clinic_key": u"cli102",
"schedule_template_time_interval_key":"2",
"schedule_template_key":"2",
"date_key":datetime.date(2021,10,6),
"schedule_owner_key":"2",
"schedule_template_schedule_track_key":"2",
"schedule_content_label_key":"2",
"start_time_key":datetime.time(16,10,00),
"end_time_key":datetime.time(16,50,00),
"priority":2,
},
{
"clinic_key": u"cli103",
"schedule_template_time_interval_key":"3",
"schedule_template_key":"3",
"date_key":datetime.date(2021,10,7),
"schedule_owner_key":"3",
"schedule_template_schedule_track_key":"3",
"schedule_content_label_key":"3",
"start_time_key":datetime.time(19,10,00),
"end_time_key":datetime.time(20,00,00),
"priority":1,
},
{
"clinic_key": u"cli104",
"schedule_template_time_interval_key":"4",
"schedule_template_key":"4",
"date_key":datetime.date(2021,10,8),
"schedule_owner_key":"4",
"schedule_template_schedule_track_key":"4",
"schedule_content_label_key":"4",
"start_time_key":datetime.time(20,40,00),
"end_time_key":datetime.time(21,15,00),
"priority":3,
},
]
dataframe = pandas.DataFrame(
records,
# In the loaded table, the column order reflects the order of the
# columns in the DataFrame.
columns=[
"clinic_key",
"schedule_template_time_interval_key",
"schedule_template_key",
"date_key",
"schedule_owner_key",
"schedule_template_schedule_track_key",
"schedule_content_label_key",
"start_time_key",
"end_time_key",
"priority",
],
# Optionally, set a named index, which can also be written to the
# BigQuery table.
)
job_config = bigquery.LoadJobConfig(
# Specify a (partial) schema. All columns are always written to the
# table. The schema is used to assist in data type definitions.
schema=[
# Specify the type of columns whose type cannot be auto-detected. For
# example the "title" column uses pandas dtype "object", so its
# data type is ambiguous.
bigquery.SchemaField("clinic_key", "STRING"),
bigquery.SchemaField("schedule_template_time_interval_key","STRING"),
bigquery.SchemaField("schedule_template_key","STRING"),
bigquery.SchemaField("schedule_owner_key","STRING"),
bigquery.SchemaField("schedule_template_schedule_track_key","STRING"),
bigquery.SchemaField("schedule_content_label_key","STRING"),
bigquery.SchemaField("priority","INTEGER"),
],
write_disposition="WRITE_TRUNCATE",
)
job = client.load_table_from_dataframe(
dataframe, table_id, job_config=job_config
) # Make an API request.
job.result() # Wait for the job to complete.
table = client.get_table(table_id) # Make an API request.
print(
"Loaded {} rows and {} columns to {}".format(
table.num_rows, len(table.schema), table_id
)
)
Sample dataframe:
"clinic_key": u"cli101",
"schedule_template_time_interval_key":"1",
"schedule_template_key":"1",
"date_key":datetime.date(2021,10,5),
"schedule_owner_key":"1",
"schedule_template_schedule_track_key":"1",
"schedule_content_label_key":"1",
"start_time_key":datetime.time(19,5,00),
"end_time_key":datetime.time(20,00,00),
"priority":1,
When you are using DATE and TIME in schema the values in the column field should be passed in datetime.date(2021,2,21) and datetime.time(16,00,00) format respectively.
Schema Output :
Query Output :
Upvotes: 2