Reputation: 135

how to merge data while loading them into hive?

I'm tring to use hive to analysis our log, and I have a question.

Assume we have some data like this:
A 1
A 1
A 1
B 1
C 1
B 1
How can I make it like this in hive table(order is not important, I just want to merge them) ?
A 1
B 1
C 1

without pre-process it with awk/sed or something like that?
Thanks!

Upvotes: 1

Answers (4)

brandon.bell

Reputation: 1411

There's no way to pre-process the data while it's being loaded without using an external program. You could use a view if you'd like to keep the original data intact.

hive> SELECT * FROM table1;
OK
A       1
A       1
A       1
B       1
C       1
B       1
B       2 # Added to show it will group correctly with different values

hive> CREATE VIEW table2 (fld1, fld2) AS SELECT fld1, fld2 FROM table1 GROUP BY fld1, fld2;
hive> SELECT * FROM table2;
OK
A       1
B       1
B       2
C       1

Upvotes: 0

Kapil Sharma

Reputation: 41

For merging the data, we can also use "UNION ALL" , it can also merge two different types of datatypes.

insert overwrite into table test1 (select x.* from t1 x ) UNION ALL (select y.* from t2 y);

here we are merging two tables data (t1 and t2) into one single table test1.

Upvotes: 0

Navneet Kumar

Reputation: 3752

Step 1: Create a Hive table for input data set . create table if not exists table1 (fld1 string, fld2 string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; (i assumed field seprator is \t, you can replace it with actual separator)

Step 2 : Run below to get the merge data you are looking for

create table table2 as select fld1,fld2 from table1 group by fld1,fld2 ;

I tried this for below input set

hive (default)> select * from table1; OK

A 1

B 1

C 1

B 1

create table table4 as select fld1,fld2 from table1 group by fld1,fld2 ;

hive (default)> select * from table4; OK

A 1

B 1

C 1

You can use external table as well , but for simplicity I have used managed table here.

Upvotes: 3

user783112

Reputation: 21

One idea.. you could create a table around the first file (called 'oldtable').

Then run something like this.... create table newtable select field1, max(field) from oldtable group by field1;

Not sure I have the syntax right, but the idea is to get unique values of the first field, and only one of the second. Make sense?

Upvotes: 2

how to merge data while loading them into hive?

Answers (4)

Related Questions