Reputation: 37
we just noticed around 09/27/2012 our data have been duplicated from doing csv files upload (using Java API). Logs indicated no error during upload but we have confirmed a majority of rows during that day have been duplicated (there is distinct timestamp in micro second per row) Is there any known glitches during that day? We're at a loss of how to prevent this from happening again.
Thanks for any feed back.
Upvotes: 3
Views: 520
Reputation: 26
Thanks for looking into this for us. It is hard (almost impossible) to believe that data got duplicated on the bigquery side. That said nothing we can see seems to indicate otherwise. As mentioned we have a microsecond timestamp value on every row. For the two job IDs referenced I picked a row at random and made sure that within all of the data we've ever imported it was a unique value. When I run the same query I get two (identical) rows in our bigquery table.
Upvotes: 1
Reputation: 26637
We don't know of any reason why data would be duplicated during import. If you provide us with more information, such as your job id and project id that would be helpful in diagnosing the issue.
In general, as Michael mentioned in his answer, people who see duplicated data have generally run the same job twice. (note that if a job fails, the table should not be modified in any way).
A way to prevent these kinds of collisions is to name your job, since we enforce job name uniqueness on a per-project level. For example, if you do a load once a day, you might want to name your job id something like "job_2012_10_08_load1". That way if you tried to run the same job twice, the second one would fail on start.
Upvotes: 0
Reputation: 7887
First: make sure (by checking the load job history), that you didn't actually end up running a load job twice. If you are using the bq command line client:
# Show all jobs for your selected project
bq ls -j
# Will result in a list such as:
...
job_d8fc9d7eefb2e9243b1ffde484b3ab8a load FAILURE 29 Sep 00:35:26 0:00:00
job_4704a91875d9e0c64f7aaa8de0458696 load SUCCESS 29 Sep 00:28:45 0:00:05
...
# Find the load jobs pertaining to the time of data loading. To show detailed information
# about which files you ingested in the load job, run a command on the individual jobs
# that might have been repeats:
bq --format prettyjson show -j job_d8fc9d7eefb2e9243b1ffde484b3ab8a
Upvotes: 1