Reputation: 25
I am unable to create unique partitions. when i am uploading data, it's creating all the dates as partition again and again, even the dates are same
create table product_order1(id int,user_id int,amount int,product string, city string, txn_date string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK Time taken: 0.133 seconds
LOAD DATA LOCAL INPATH 'txn' INTO TABLE product_order1;
Loading data to table oct19.product_order1
Table oct19.product_order1 stats: [numFiles=1, totalSize=303]
OK
Time taken: 0.426 seconds
hive>
> set hive.exec.dynamic.partition = true;
hive>
> set hive.exec.dynamic.partition.mode = true;
hive>
> create table dyn_part(id int,user_id int,amount int,product string,city string) PARTITIONED BY(txn_date string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
OK
Time taken: 0.14 seconds
hive >
INSERT OVERWRITE TABLE dyn_part PARTITION(txn_date) select id,user_id,amount,product,city,txn_date from product_order1;
Result which i have received :-
Loading data to table oct19.dyn_part partition (txn_date=null)
Time taken for load dynamic partitions : 944
Loading partition {txn_date=04-02-2015}
Loading partition {txn_date= 03-04-2015}
Loading partition {txn_date=01-02-2015}
Loading partition {txn_date=03-04-2015}
Loading partition {txn_date= 01-01-2015}
Loading partition {txn_date=01-01-2015}
Loading partition {txn_date= 01-02-2015}
Time taken for adding to write entity : 5
Partition oct19.dyn_part{txn_date= 01-01-2015} stats: [numFiles=1, numRows=1, totalSize=25, rawDataSize=24]
Partition oct19.dyn_part{txn_date= 01-02-2015} stats: [numFiles=1, numRows=1, totalSize=25, rawDataSize=24]
Partition oct19.dyn_part{txn_date= 03-04-2015} stats: [numFiles=1, numRows=2, totalSize=50, rawDataSize=48]
Partition oct19.dyn_part{txn_date=01-01-2015} stats: [numFiles=1, numRows=1, totalSize=26, rawDataSize=25]
Partition oct19.dyn_part{txn_date=01-02-2015} stats: [numFiles=1, numRows=1, totalSize=26, rawDataSize=25]
Partition oct19.dyn_part{txn_date=03-04-2015} stats: [numFiles=1, numRows=1, totalSize=26, rawDataSize=25]
Partition oct19.dyn_part{txn_date=04-02-2015} stats: [numFiles=1, numRows=1, totalSize=25, rawDataSize=24]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 4.03 sec HDFS Read: 4166 HDFS Write: 614 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 30 msec
Upvotes: 1
Views: 57
Reputation: 38325
I noticed that some dates contain spaces, some are without spaces:
txn_date= 03-04-2015
and txn_date=03-04-2015
Try to add trim
:
INSERT OVERWRITE TABLE dyn_part PARTITION(txn_date)
select id, user_id, amount, product, city, trim(txn_date) as txn_date
from product_order1;
Also better use Hive compatible date format yyyy-MM-dd
, it is sortable.
To format date and remove spaces at the same time you can use regexp_replace.
If your current format is MM-dd-yyyy
, then you can format it like this:
select regexp_replace(' 03-04-2015','.*?(\\d{2})-(\\d{2})-(\\d{4})','$3-$1-$2') --fix accordingly if it is dd-MM-yyyy. In this case it should be '$3-$2-$1' in the replacement template.
Returns:
2015-03-04
Or load like this:
INSERT OVERWRITE TABLE dyn_part PARTITION(txn_date)
select id, user_id, amount, product, city,
regexp_replace(txn_date,'.*?(\\d{2})-(\\d{2})-(\\d{4})','$3-$1-$2') as txn_date
from product_order1;
regexp means:
'.*?
- any character zero or more times
(\\d{2})
- first group of 2 digits which will be addressed in the replacement as $1
-
dash literally
(\\d{2})
- second group of 2 digits which will be addressed in the replacement as $2
-
dash literally
(\\d{4})
- third group of 4 digits which will be addressed in the replacement as $3
And as a replacement '$3-$1-$2'
we take groups from regexp in proper order, delimited with dash. Supposed $3 is year, $1 is month, $2 is day in your dates. You place groups in proper order to get yyyy-MM-dd
because it is not possible to understand which format are you using: MM-dd-yyyy
or dd-MM-yyyy
Upvotes: 1