Reputation: 19
I try to build a beeline script to load a CSV file stored in HDFS to an external table in Hive. First I create an empty external table:
CREATE EXTERNAL TABLE IF NOT EXISTS db.table
(column string)
STORED AS AVRO
LOCATION '/foo/bar'
After that I checked if the table was created. And it was. I already loaded a CSV file into HDFS with:
hdfs dfs -put test.csv /temp/raw_csv
Now I want to load/insert (whatever) this CSV data to the external Hive table, but stored in Avro fromat (like defined in the create script). I tried it with this line:
LOAD DATA INPATH '/temp/raw_csv/test.csv' OVERWRITE INTO TABLE db.table
This line runs without an error, see the cmd output here:
. . . . . . . . . . . . . . . . . . . . . . .> . . . . . . . . . . . . . . . . . . . . . . .> INFO : Compiling command(queryId=hive_20201201130808_4120504b-6799-4833-83e7-5fa8ff8c6ca8): LOAD DATA INPATH '/temp/raw_csv/test.csv' OVERWRITE INTO TABLE db.table INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null) INFO : Completed compiling command(queryId=hive_20201201130808_4120504b-6799-4833-83e7-5fa8ff8c6ca8); Time taken: 0.427 seconds INFO : Executing command(queryId=hive_20201201130808_4120504b-6799-4833-83e7-5fa8ff8c6ca8): LOAD DATA INPATH '/temp/raw_csv/test.csv' OVERWRITE INTO TABLE db.table INFO : Starting task [Stage-0:MOVE] in serial mode INFO : Loading data to table db.table from hdfs://temp/raw_csv/test.csv INFO : Starting task [Stage-1:STATS] in serial mode INFO : Table db.table stats: [numFiles=1, totalSize=45981179] INFO : Completed executing command(queryId=hive_20201201130808_4120504b-6799-4833-83e7-5fa8ff8c6ca8); Time taken: 0.376 seconds INFO : OK No rows affected (0.87 seconds)
But if I want to select the table with following line, I get an error:
select * FROM db.test;
INFO : Compiling command(queryId=hive_20201201131414_79187a87-c5e6-4b54-aecc-54c15d9a4bfa): select * FROM db.test INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:column, type:string, comment:null) INFO : Completed compiling command(queryId=hive_20201201131414_79187a87-c5e6-4b54-aecc-54c15d9a4bfa); Time taken: 0.243 seconds INFO : Executing command(queryId=hive_20201201131414_79187a87-c5e6-4b54-aecc-54c15d9a4bfa): select * FROM db.test INFO : Completed executing command(queryId=hive_20201201131414_79187a87-c5e6-4b54-aecc-54c15d9a4bfa); Time taken: 0.006 seconds INFO : OK Error: java.io.IOException: java.io.IOException: Not a data file. (state=,code=0)
Is it possible to load data from CSV to Hive Avro table with an beeline command? And/Or what will be a better way to load this data? I normally load 1-100 million rows of data.
Upvotes: 1
Views: 2723
Reputation: 38335
It does not work that way. LOAD command does not transform data, it just moves file into table location.
If you want to convert from CSV
to AVRO
, then do these steps:
Put file directly into table location using hdfs dfs -put
or use
LOAD DATA LOCAL INPATH 'local/path/to/csv' OVERWRITE INTO TABLE db.csv_table
create avro table
Use Hive to load data from csv table to avro
insert overwrite table avro_table select ... from csv_table
SerDe is responsible for reading and writing data files, it is being used when you create table with some specific SerDe and selecting or inserting the data. LOAD does know nothing about file schema and does not use SerDe. In the last step (4) Hive will read csv table and insert the same data into Avro table, it will use SerDe specified in both tables DDL for serializing data to Avro format and de-serializing from csv.
Alternatively you can convert CSV file to AVRO file using csv2avro tool or some other tool, then load it into AVRO table location.
If you have no purpose to convert from CSV to Avro, only to load data to the table to be able to select it, then all you need is to create CSV table and put data files into it's location (step 1 and 2).
Upvotes: 4