Reputation: 2157
I have data as below:
id,name,description,category
1,sid,I am don,right?,production
I followed the steps provided in this link. The main problem is when I insert the data into the temp table, data in the 3rd column
gets separated and pushed into the 4th column i.e when it sees ","
before right
word it divides the data and pushes it into the next column. It is because I am getting data in CSV format and hence applying delimiter as ","
while creating a temp table. So, there is a mess. How can I resolve it ??
After following steps suggested by xenodevil
, I ran below query and I am getting error:
insert into perTable select * from sampleTable;
Where sampleTable is the temp table and perTable is the ORC table and sampleTable data looks as below:
+-----------------+-------------------+--------------------------+-----------------------+--+
| sampletable.id | sampletable.name | sampletable.description | sampletable.category |
+-----------------+-------------------+--------------------------+-----------------------+--+
| 1 | sid | I am don,right? | production |
+-----------------+-------------------+--------------------------+-----------------------+--+
But getting below error:
ERROR : Status: Failed
ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1560140822404_0022_1_00, diagnostics=[Task failed, taskId=task_1560140822404_0022_1_00_000000, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"id":"1","name":"sid","description":"I am don,right?","category":"production"}
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
So what is the issue over here ??
Upvotes: 0
Views: 2559
Reputation: 4754
You can use this sample flow to load data in a Hive table.
1) Since there is ambiguity related the ,
delimiter, update the file to have a delimiter of |
that identifies actual fields. So, the file looks like this. Let's call this data.csv
.
1|sid|I am don,right?|production
2) Create a table in Hive
, specifying the correct column delimiter.
hive> CREATE TABLE t1_tmp
(
id string,
name string,
description string,
category string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n';
3) Load the data from the local file
hive> LOAD DATA LOCAL INPATH '/path/to/data.csv' INTO TABLE t1_tmp;
4) View the data
hive> select * from t1_tmp;
OK
t1_tmp.id t1_tmp.name t1_tmp.description t1_tmp.category
1 sid I am don,right? production
Upvotes: 0
Reputation: 604
How are you going to determine how many commas are coming in 3rd column? If it can contain any text, then it can contain any number of commas. This must be controlled at the file generation level, i.e. when your data is being written rather that when you are trying to read it at a later time in CSV format. While generating the CVS file, you may
These are some of the common practices which reliably resolve such issues.
An unreliable way, very specific to data you have presented, and which will resolve only redundant commas in your 3rd column is to use following RegexSerDe for your ROW FORMAT
([0-9]*),([a-zA-Z ]*),([A-Za-z ,?]*),([A-Za-z ,]*)
To set this, you will need to modify the table's DDL to something like:
CREATE TABLE `your_database_name.your_table_name`(
`id` string,
`name` string,
`description` string,
`category` string
)
COMMENT 'Your Comment'
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex'='([0-9]*),([a-zA-Z ]*),([A-Za-z ,?]*),([A-Za-z ,]*)'
)
STORED AS TEXT
;
I have checked the Regex here, but you will need to adjust the DDL syntax to suite your needs.
Upvotes: 1