Reputation: 856
I am currently working with the new Google PHP libraries for BigQuery, which is working really well, however I am having some issues with performance and I haven't been able to find anything about it.
I'm using the new insertRows method to send data to BigQuery and although I always receive the response "bigquery#tableDataInsertAllResponse", I hardly ever actually see my data get inserted.
Usually these datasets that I sent are about 30-100 items at a time, I looked up quotas for BigQuery, that I wasn't exceeding it and I wasn't even close to the limits.
BigQuery does enter some of the rows I sent eventually, but it takes incredibly long before BigQuery acknowledges that I sent data, and when it does, it is in the streaming buffer for days and eventually only ends up inserting some of the data.
This is the function I use to insert data:
public function insertInto($tableName, $datasetName, $arrayToInsert) {
if (empty($arrayToInsert)) {
return true;
}
$table = $this->getTable($tableName, $datasetName);
$bqTable = $this->bigQueryIdController->getBigQueryTable($tableName, $datasetName);
$arraySize = count($arrayToInsert);
$currentSize = $bqTable->getSize();
for ($i = 0; $i < $arraySize; ++$i) {
$arrayToInsert[$i]['insertId'] = $currentSize + 1 + $i;
}
$insertResponse = $table->insertRows($arrayToInsert);
if (!$insertResponse->isSuccessful()) {
$this->handleFailedRows($insertResponse->failedRows());
return false;
} else {
$this->bigQueryIdController->updateBigQueryTable($bqTable, $arraySize);
return true;
}
}
Is there any way I can speed things up? I can't really use BigQuery right now because of how long it takes before my data is actually available.
Edit:
I have already found that rows can be dismissed on duplicate ID entries, however, this will not happen. Whenever I try to insert into a table, it will get the size of the table locally, not remotely, but a stored value in my database. In case it's not present, it will fetch the size remotely and store it locally. Whenever the insertion is successful, it adds the number of "inserted" items to the count and updates the local size. The value in my database is how many items should be present.
Upvotes: 0
Views: 292
Reputation: 207982
Your insertId
is not unique if you run multiple times, and you end up later with rows having the same insertId --> wiping out previous rows from the buffer. This has also a snowball effect that you constantly have data in the streaming buffer, and never ending out to storage, as it's being wiped out with a new request having the same insertIds.
Make sure insertId
is something atomic, or microtime. insertId should be very different for every request. In rare occasions you want to stream from 5 locations the same row, and allowing only 1 row, that's when you need insertId to be common to retain only 1 from 5 (that was an example)
Upvotes: 0