Reputation: 11
I am currently trying to insert all our DT files v2 into BQ. I already did it with the click file, I spotted any trouble.
But it's not the same game with the activity and impression.
I wrote a quick script to help me in making the schema for the insertion :
import csv,json
import glob
data = []
for i in glob.glob('*.csv'):
print i
b = i.split("_")
print b[2]
with open(i, 'rb') as f:
reader = csv.reader(f)
row1 = next(reader)
title = [w.replace(' ', '_').replace('/', '_').replace(':', '_').replace('(', '_').replace(')', '').replace("-", "_") for w in row1]
print title
for a in title:
j={"name":"{0}".format(a),"type":"string","mode":"nullable"}
print j
if j not in data:
data.append(j)
with open('schema_' + b[2] + '.json', 'w') as outfile:
json.dump(data, outfile)
After that, I use the small bash script to insert all our data from our GCS .
#!/bin/bash
prep_files() {
date=$(echo "$f" | cut -d'_' -f4 | cut -c1-8)
echo "$n"
table_name=$(echo "$f" | cut -d'_' -f1-3)
bq --nosync load --field_delimiter=',' DCM_V2."$table_name""_""$date" "$var" ./schema/v2/schema_"$n".json
}
num=1
for var in $(gsutil ls gs://import-log/01_v2/*.csv.gz)
do
if test $num -lt 10000
then
echo "$var"
f=$(echo "$var" | cut -d'/' -f5)
n=$(echo "$f" | cut -d'_' -f3)
echo "$n"
prep_files
num=$(($num+1))
else
echo -e "Wait the next day"
echo "$num"
sleep $(( $(date -d 'tomorrow 0100' +%s) - $(date +%s) ))
num=0
fi
done
echo 'Import done'
But I have this kind of error :
Errors: Too many errors encountered. (error code: invalid) /gzip/subrange//bigstore/import-log/01_v2/dcm_accountXXX_impression_2016101220_20161013_073847_299112066.csv.gz: CSV table references column position 101, but line starting at position:0 contains only 101 columns. (error code: invalid)
So I check the number of columns in my schema with :
$awk -F',' '{print NF}'
But I have the good number of column...
So I thought that was because we had comma in value (some publishers are using a .NET framework, that allows comma in url). But theses values are enclosed with double quote.
So I made a test with a small file :
id,url
1,http://www.google.com
2,"http://www.google.com/test1,test2,test3"
And this loading works...
If someone have a clue to help me, that could be realy great. :)
EDIT : I did another test by make the load with an already decompressed file.
Too many errors encountered. (error code: invalid) file-00000000: CSV table references column position 104, but line starting at position:2006877004 contains only 104 columns. (error code: invalid)
I used this command to find the line : $tail -c 2006877004 dcm_accountXXXX_activity_20161012_20161013_040343_299059260.csv | head -n 1
I get :
3079,10435077,311776195,75045433,1,2626849,139520233,IT,,28,0.0,22,,4003208,,dc_pre=CLHihcPW1M8CFTEC0woddTEPSQ;~oref=http://imasdk.googleapis.com/js/core/bridge3.146.2_en.html,1979747802,1476255005253094,2,,4233079,CONVERSION,POSTVIEW,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1,0.000000000,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
After that : $head -n1 dcm_account8897_activity_20161012_20161013_040343_299059260.csv | awk -F',' '{print NF}'
Response : 102
So, I have 104 columns in the first row and 102 on this one...
Anyone else have trouble with the DT files v2 ?
Upvotes: 1
Views: 585
Reputation: 11
I had this similar issue and found the problem was due to a few records being separated by carriage returns into 2 lines. Removing \r solved the problem
The line affected is usually not the line reflected in the error log.
Upvotes: 1
Reputation: 195
I would open the csv file from google sheets, and compare the columns with the schema you generated. Most probably you will found a bug in the schema.
Upvotes: 0