Reputation: 33
I'm trying to load 2 CSV file from the HDFS folder,
Create Table Click_data ( A Timestamp, B string, C int, D BIGINT, E string, F string, G decimal(10,2), H BIGINT, I string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE Location ‘/user/data/’
tblproperties ("skip.header.line.count"="1");
Having 2 datasets in CSV format where we need to skip the headers how to load both files to the same table from the HDFS directory.
Upvotes: 1
Views: 836
Reputation: 38290
You can have many files in the table location and header will be skipped in all files. Table location usually named as table: /user/data/click_data
. Each table should have it's own location, /user/data/
seems not correct location, it looks like common folder where other tables locations are. Carefully check that location belongs to your table only.
Copy both files into table location '/user/data/' using hadoop fs -cp
or hdfs dfs -cp
or AWS CLI command aws s3 cp
:
hadoop fs -cp 'source/path/to/files/*' '/user/data/Click_data'
Or you can use LOAD hive command for the same:
hive> LOAD DATA INPATH 'source/path/to/files/file1.csv' INTO TABLE Click_data;
hive> LOAD DATA INPATH 'source/path/to/files/file2.csv' INTO TABLE Click_data;
In both cases data will be selectable and header skipped in both files.
Alternatively you can create table on top of location where the files are (if there are no other files or folders which do not belong to your table). Just specify correct location. and use create EXTERNAL table ...
. Also you can create table without location specified, check location using DESC FORMATTED
command, then copy files into that location.
After loading files into table location better refresh table statistics, using ANALYZE command, this will allow CBO to work correctly. Read this: https://stackoverflow.com/a/41021682/2700344
Upvotes: 1