Reputation: 25
I have to write an application in c# that reads billion of record from files, and then through IMPALA ODBC writes records on Impala table. I had already tried to execute insert query using single statement with parameter
INSERT INTO table VALUES (?,?,.....,?)
or using multiple inserts:
INSERT INTO table VALUES (?,?,.....,?),(?,?,.....,?),...,(?,?,.....,?)
But the firs is very slow and create one file on hdfs for each records; the second is more fast but the query is very long and for billions records I receive the following error:
[Cloudera][SQLEngine] (31580) The length of the statement exceeds the maximum: 16384.
Someone has some solution for my problem considering that I must use c# as language for my application.
Thanks
Upvotes: 0
Views: 440
Reputation: 3634
I think you need different approach to do this. I.e. don't read the CSV through C# just to send the values inside to the server. Instead issue commands against the server to read the file for you.
To begin, create a table for the CSV file in your database. You decide if this needs to be done programmatically or through a tool. Then read the CSV with the LOAD DATA
statement into the new table. Then use INSERT INTO SELECT ...
statement to manipulate the newly created table.
Pseudo code example:
CREATE TABLE DataHeap(whatever the structure of your CSV is)
LOAD DATA INPATH 'HDFS-PATH-TO-CSV-FILE' INTO TABLE DataHeap
INSERT INTO YOUR-DESTINATION-TABLE SELECT whatever FROM DataHeap WHERE ...
Upvotes: 0