Reputation: 51
I'm trying to query data from a MySQL server and write it to Google BigQuery using pandas .to_gbq api.
def production_to_gbq(table_name_prod,prefix,table_name_gbq,dataset,project):
# Extract data from Production
q = """
SELECT *
FROM
{}
""".format(table_name_prod)
df = pd.read_sql(q, con)
# Write to gbq
df.to_gbq(dataset + table_name_gbq, project, chunksize=1000, verbose=True, reauth=False, if_exists='replace', private_key=None)
return df
I keep getting a 400 error indicating invalid input.
Load is 100.0% Complete
---------------------------------------------------------------------------
BadRequest Traceback (most recent call last)
/usr/local/lib/python3.6/site-packages/pandas_gbq/gbq.py in load_data(self, dataframe, dataset_id, table_id, chunksize, schema)
569 self.client, dataframe, dataset_id, table_id,
--> 570 chunksize=chunksize):
571 self._print("\rLoad is {0}% Complete".format(
/usr/local/lib/python3.6/site-packages/pandas_gbq/_load.py in load_chunks(client, dataframe, dataset_id, table_id, chunksize, schema)
73 destination_table,
---> 74 job_config=job_config).result()
/usr/local/lib/python3.6/site-packages/google/cloud/bigquery/job.py in result(self, timeout)
527 # TODO: modify PollingFuture so it can pass a retry argument to done().
--> 528 return super(_AsyncJob, self).result(timeout=timeout)
529
/usr/local/lib/python3.6/site-packages/google/api_core/future/polling.py in result(self, timeout)
110 # Pylint doesn't recognize that this is valid in this case.
--> 111 raise self._exception
112
BadRequest: 400 Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 10; errors: 1. Please look into the error stream for more details.
During handling of the above exception, another exception occurred:
GenericGBQException Traceback (most recent call last)
<ipython-input-73-ef9c7cec0104> in <module>()
----> 1 departments.to_gbq(dataset + table_name_gbq, project, chunksize=1000, verbose=True, reauth=False, if_exists='replace', private_key=None)
2
/usr/local/lib/python3.6/site-packages/pandas/core/frame.py in to_gbq(self, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key)
1058 return gbq.to_gbq(self, destination_table, project_id=project_id,
1059 chunksize=chunksize, verbose=verbose, reauth=reauth,
-> 1060 if_exists=if_exists, private_key=private_key)
1061
1062 @classmethod
/usr/local/lib/python3.6/site-packages/pandas/io/gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key)
107 chunksize=chunksize,
108 verbose=verbose, reauth=reauth,
--> 109 if_exists=if_exists, private_key=private_key)
/usr/local/lib/python3.6/site-packages/pandas_gbq/gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key, auth_local_webserver, table_schema)
980 connector.load_data(
981 dataframe, dataset_id, table_id, chunksize=chunksize,
--> 982 schema=table_schema)
983
984
/usr/local/lib/python3.6/site-packages/pandas_gbq/gbq.py in load_data(self, dataframe, dataset_id, table_id, chunksize, schema)
572 ((total_rows - remaining_rows) * 100) / total_rows))
573 except self.http_error as ex:
--> 574 self.process_http_error(ex)
575
576 self._print("\n")
/usr/local/lib/python3.6/site-packages/pandas_gbq/gbq.py in process_http_error(ex)
453 # <https://cloud.google.com/bigquery/troubleshooting-errors>`__
454
--> 455 raise GenericGBQException("Reason: {0}".format(ex))
456
457 def run_query(self, query, **kwargs):
GenericGBQException: Reason: 400 Error while reading data, error message: CSV table encountered too many errors, giving up. Rows: 10; errors: 1. Please look into the error stream for more details.
I've investigated the table schema,
id INTEGER NULLABLE
name STRING NULLABLE
description STRING NULLABLE
created_at INTEGER NULLABLE
modified_at FLOAT NULLABLE
and it is the same as the dataframe:
id int64
name object
description object
created_at int64
modified_at float64
The table is created in GBQ but remains empty.
I read around a little but am not finding much on pandas.to_gbq api, except for this which seemed relevant but has no reply:
bigquery table is empty when using pandas to_gbq
I found one potential solution about numbers in the object datatypes which are being passed into the GBQ table without quotes, something fixed by setting column datatypes as string.
I use to_gbq on pandas for updating Google BigQuery and get GenericGBQException
I tried the fix:
for col in df.columns:
if df[col].dtypes == object:
df[col] = df[col].fillna('')
df[col] = df[col].astype(str)
Unfortunately I still get the same error. Similarly trying to format missing data and setting dtypes for int and float also gives the same error.
Is there something I'm missing?
Upvotes: 5
Views: 10027
Reputation: 1
I had similar issue because of an unwanted column 'Unnamed 0' present in the dataset. I removed that and problem got solved. Try to see the shape and head, if any null column or unwanted column exists in the dataset
Upvotes: 0
Reputation: 3514
I had some invalid characters in string
columns (object
in pandas
). I use @Echochi approach and it worked
for col in list(parsed_data.select_dtypes(include='object').columns):
parsed_data[col] = parsed_data[col].apply(lambda x:re.sub('[^A-Za-z0-9]+','', str(x)))
It was a little bit to restrictive with the accepted characters, so I used a more general aproach, given that the biquery compatibility with UTF-8
bigquery docs
for col in list(parsed_data.select_dtypes(include='object').columns):
parsed_data[col] = parsed_data[col].apply(lambda x:re.sub(r"[^\u0900-\u097F]+",,'?', str(x)))
with r"[^\u0900-\u097F]+"
you will accept all UTF-8
compatible charset
Upvotes: 1
Reputation: 23
I have ended up here several times when I've had a similar issue with importing to bigquery from parquet files on cloud storage. However, each time I've forgotten the way to solve it, so I hope it's not too much of a breach of protocol to leave my findings here!
What I realised was that I have columns that are all NULL, which will look like they have a data type in pandas, but if you use pyarrow.parquet.read_schema(parquet_file), you will see that the data type is null.
After removing the column, the upload will work!
Upvotes: 1
Reputation: 183
Found that bigquery cannot properly handle \r (sometimes \n too) Had the same issue, localized the problem and I was really surprised when just replacing \r with space fixed it:
for col in list(df.columns):
df[col] = df[col].apply(lambda x: x.replace(u'\r', u' ') if isinstance(x, str) or isinstance(x, unicode) else x)
Upvotes: 2