odbc impala bad writing performance

Question

I have to write an application in c# that reads billion of record from files, and then through IMPALA ODBC writes records on Impala table. I had already tried to execute insert query using single statement with parameter

INSERT INTO table VALUES (?,?,.....,?)

or using multiple inserts:

INSERT INTO table VALUES (?,?,.....,?),(?,?,.....,?),...,(?,?,.....,?)

But the firs is very slow and create one file on hdfs for each records; the second is more fast but the query is very long and for billions records I receive the following error:

[Cloudera][SQLEngine] (31580) The length of the statement exceeds the maximum: 16384.

Someone has some solution for my problem considering that I must use c# as language for my application.

Thanks

Bozhidar Stoyneff · Accepted Answer

I think you need different approach to do this. I.e. don't read the CSV through C# just to send the values inside to the server. Instead issue commands against the server to read the file for you.

To begin, create a table for the CSV file in your database. You decide if this needs to be done programmatically or through a tool. Then read the CSV with the LOAD DATA statement into the new table. Then use INSERT INTO SELECT ... statement to manipulate the newly created table.

Pseudo code example:

CREATE TABLE DataHeap(whatever the structure of your CSV is)
LOAD DATA INPATH 'HDFS-PATH-TO-CSV-FILE' INTO TABLE DataHeap
INSERT INTO YOUR-DESTINATION-TABLE SELECT whatever FROM DataHeap WHERE ...

odbc impala bad writing performance

Answers (1)

Related Questions