Reputation: 701
I am trying to load a CSV file to BigQuery using a python script modelled on the python sample code here: https://developers.google.com/bigquery/docs/developers_guide
But I'm running into the following error when I try to load a table with the REST API:
{'status': '200', 'content-length': '1492', 'expires': 'Fri, 01 Jan 1990 00:00:00 GMT', 'server': 'HTTP Upload Server Built on Jun 14 2012 02:12:09 (1339665129)', 'etag': '"tcivyOj9QvKAbuEJ5MEMf9we85w/-mxYhUDjvvydxcebR8fXI6l_5RQ"', 'pragma': 'no-cache', 'cache-control': 'no-cache, no-store, must-revalidate', 'date': 'Fri, 06 Jul 2012 22:30:55 GMT', 'content-type': 'application/json'}
{
"kind": "bigquery#job",
"etag": "\"tcivyOj9QvKAbuEJ5MEMf9we85w/-mxYhUDjvvydxcebR8fXI6l_5RQ\"",
"id": "firespotter.com:firespotter:job_d6b99265278b4c0da9c3033acf39d6b2",
"selfLink": "https://www.googleapis.com/bigquery/v2/projects/firespotter.com:firespotter/jobs/job_d6b99265278b4c0da9c3033acf39d6b2",
"jobReference": {
"projectId": "firespotter.com:firespotter",
"jobId": "job_d6b99265278b4c0da9c3033acf39d6b2"
},
"configuration": {
"load": {
"schema": {
"fields": [
{
"name": "date",
"type": "STRING"
},
{
"name": "time",
"type": "STRING"
},
{
"name": "call_uuid",
"type": "STRING"
},
{
"name": "log_level",
"type": "STRING"
},
{
"name": "file_line",
"type": "STRING"
},
{
"name": "message",
"type": "STRING"
}
]
},
"destinationTable": {
"projectId": "385479794093",
"datasetId": "telephony_logs",
"tableId": "table_name"
},
"createDisposition": "CREATE_IF_NEEDED",
"writeDisposition": "WRITE_TRUNCATE",
"encoding": "UTF-8"
}
},
"status": {
"state": "DONE",
"errorResult": {
"reason": "notFound",
"message": "Not Found: Dataset 385479794093:telephony_logs"
},
"errors": [
{
"reason": "notFound",
"message": "Not Found: Dataset 385479794093:telephony_logs"
}
]
}
}
The projectId listed in the error "385479794093" is not the projectId that I pass in, it's the "project number". The projectId should be "firespotter.com:firespotter":
{
"kind": "bigquery#datasetList",
"etag": "\"tcivyOj9QvKAbuEJ5MEMf9we85w/ZMa8z6LKMgWZIqLWh3ti2SsSs4g\"",
"datasets": [
{
"kind": "bigquery#dataset",
"id": "firespotter.com:firespotter:telephony_logs",
"datasetReference": {
"datasetId": "telephony_logs",
"projectId": "firespotter.com:firespotter"
}
}
]
}
Why does the REST API insist on supplying its own incorrect projectId, when I pass the correct value in three different places? Is there another place where I need to pass in or set the Project ID?
For reference, here is the relevant code snippet:
PROJECT = 'firespotter.com:firespotter'
DATASET = 'telephony_logs'
FLOW = OAuth2WebServerFlow(
client_id='385479794093.apps.googleusercontent.com',
client_secret='<a_secret_here>',
scope='https://www.googleapis.com/auth/bigquery',
user_agent='firespotter-upload-script/1.0')
def loadTable(http, projectId, datasetId, tableId, file_path, replace=False):
url = "https://www.googleapis.com/upload/bigquery/v2/projects/" + projectId + "/jobs"
# Create the body of the request, separated by a boundary of xxx
mime_data = ('--xxx\n' +
'Content-Type: application/json; charset=UTF-8\n' + '\n' +
'{\n' +
' "projectId": "' + projectId + '",\n' +
' "configuration": {\n' +
' "load": {\n' +
' "schema": {\n' +
' "fields": [\n' +
' {"name":"date", "type":"STRING"},\n' +
' {"name":"time", "type":"STRING"},\n' +
' {"name":"call_uuid", "type":"STRING"},\n' +
' {"name":"log_level", "type":"STRING"},\n' +
' {"name":"file_line", "type":"STRING"},\n' +
' {"name":"message", "type":"STRING"}\n' +
' ]\n' +
' },\n' +
' "destinationTable": {\n' +
' "projectId": "' + projectId + '",\n' +
' "datasetId": "' + datasetId + '",\n' +
' "tableId": "' + tableId + '"\n' +
' },\n' +
' "createDisposition": "CREATE_IF_NEEDED",\n' +
' "writeDisposition": "' + ('WRITE_TRUNCATE' if replace else 'WRITE_APPEND') + '",\n' +
' "encoding": "UTF-8"\n' +
' }\n' +
' }\n' +
'}\n' +
'--xxx\n' +
'Content-Type: application/octet-stream\n' +
'\n')
# Append data from the specified file to the request body
f = open(file_path, 'r')
header_line = f.readline() # skip header line
mime_data += f.read()
# Signify the end of the body
mime_data += ('--xxx--\n')
headers = {'Content-Type': 'multipart/related; boundary=xxx'}
resp, content = http.request(url, method="POST", body=mime_data, headers=headers)
print str(resp) + "\n"
print content
# --- Main ----------------------------------------------
def main(argv):
csv_path = args[0]
# If the credentials don't exist or are invalid, run the native client
# auth flow. The Storage object will ensure that if successful the good
# credentials will get written back to a file.
storage = Storage('bigquery2_credentials.dat') # Choose a file name to store the credentials.
credentials = storage.get()
if credentials is None or credentials.invalid:
credentials = run(FLOW, storage)
# Create an httplib2.Http object to handle our HTTP requests and authorize it
# with our good credentials.
http = httplib2.Http()
http = credentials.authorize(http)
loadTable(http, PROJECT, DATASET, 'table_name', csv_path, replace=True)
if __name__ == '__main__':
main(sys.argv)
Upvotes: 0
Views: 2928
Reputation: 26617
Did you recently set the project id to firespotter.com:firespotter? If the dataset was created before the project was named, there will be a mismatch between the old project id and the new. There is an automated system that updates project ids, but it is possible that it hasn't run yet or is having a problem (I'm on vacation right now, so can't check). Hopefully, if you retry again some time soon it will just work. If not, let us know.
Upvotes: 1
Reputation: 6625
There are a few questions here:
Why did my load job fail? Just to check, was that the entire request you sent? If so, it looks like there's no data to be loaded, i.e. sourceUris
is empty. If so, that's the problem, and we're apparently returning the world's worst error message.
Why the numeric project ID? BigQuery uses the project name and the associated numeric ID interchangeably, so all you're seeing is that we tend to convert project names to IDs on the way in. Just to confirm, if you visit the Google APIs Console and look up your project, do you see that same numeric ID in the url?
Why does the project ID get specified in multiple places? First, it seems that you specified the project ID as a top-level attribute in the job; that shouldn't be necessary. (I suspect that it simply overrides whatever project ID you specify in the job reference itself.) That leaves two locations -- once as part of the job reference and the other as part of the table reference. These actually signify two different things -- the one in the job specifies what project you're inserting a job into, i.e. who's paying for the job, and the one in the table specifies what project the resulting table lives in, i.e. who owns the resulting data. In general, these will be the same, but the API allows them to be distinct. (This could be useful if, for example, you built a service that needed to insert data into tables ultimately owned by customers.)
Upvotes: 0