Naveen Reddy Marthala
Naveen Reddy Marthala

Reputation: 3123

Hive - create hive table from specific data of three csv files in hdfs

I have three .csv files, each in different hdfs directory. I now want to make a Hive internal table with data from those three files. I want four columns from first file, three columns from second file and two columns from third file. first file share an unique id column with second file and third file share another unique id column with third file. both unique ids are present in second file; using these ids I would like to left-outer-join to make table.

file 1: '/directory_1/sub_directory_1/table1_data_on_01_01_2014.csv'
file 2: '/directory_2/sub_directory_2/table2_data_on_01_01_2014.csv'
file 3: '/directory_3/sub_directory_3/table3_data_on_01_01_2014.csv'

contents of file 1:

unique_id_1,age,department,reason_of_visit,--more columns--,,,
id_11,entry_12,entry_13,entry_14,--more entries--
id_12,entry_22,entry_23,entry_24,--more entries--
id_13,entry_32,entry_33,entry_34,--more entries--

contents of file 2:

unique_id_1,date_of_transaction,transaction_fee,unique_id_2--more columns--,,,
id_11,entry_121,entry_131,id_21,--more entries--
id_12,entry_221,entry_231,id_22,--more entries--
id_13,entry_321,entry_331,id_23,--more entries--

contents of file 3:

unique_id_2,diagnosis,gender --more columns--,,,
id_21,entry_141,entry_151,--more entries--
id_22,entry_241,entry_151,--more entries--
id_23,entry_341,entry_151,--more entries--

I now want to make an internal table like this:

unique_id_1 age department reason_of_visit date_of_transaction unique_id_2 transaction_fee diagnosis gender
id_11 entry_12 entry_13 entry_14 entry_121 entry_131 id_21 entry_141 entry_151
id_12 entry_22 entry_23 entry_24 entry_221 entry_231 id_22 entry_241 entry_251
id_13 entry_32 entry_33 entry_34 entry_321 entry_331 id_23 entry_341 entry_251

How do i accomplish this?

Upvotes: 0

Views: 264

Answers (1)

steven-matison
steven-matison

Reputation: 1659

@Naveen Kumar The solution here is to create external tables for your 3 sources. Next create combined internal table with the schema for columns you need from the 3 sources. I call these temp or staging tables. Once these staging tables are created, you should be able to do your joined select as an INSERT INTO combined_table SELECT ...

Upvotes: 2

Related Questions