user2543622
user2543622

Reputation: 6756

hive understanding table creation

I am taking a mooc.

it told us to upload a few files from our PC to hdfs using below commands

azure storage blob upload local_path container data/logs/2008-01.txt.gz

I did the same. later on when I typed below command in PUTTY secure shell I was able to see that file

hdfs dfs -ls /data/logs
Found 6 items
-rwxrwxrwx   1     331941 2016-03-03 15:56 /data/logs/2008-01.txt.gz
-rwxrwxrwx   1     331941 2016-03-03 15:58 /data/logs/2008-02.txt.gz
-rwxrwxrwx   1     331941 2016-03-03 15:58 /data/logs/2008-03.txt.gz
-rwxrwxrwx   1     331941 2016-03-03 15:58 /data/logs/2008-04.txt.gz
-rwxrwxrwx   1     331941 2016-03-03 15:58 /data/logs/2008-05.txt.gz
-rwxrwxrwx   1     331941 2016-03-03 15:58 /data/logs/2008-06.txt.gz

then we started a hive terminal and first created a table and then inserted data into that table using

load data inpath '/data/logs' into TABLE rawlog;

Then we created an external table using below command

CREATE EXTERNAL TABLE cleanlog
(log_date DATE,
log_time STRING,
c_ip STRING,
cs_username STRING,
s_ip STRING,
s_port STRING,
cs_method STRING,
cs_uri_stem STRING,
cs_uri_query STRING,
sc_status STRING,
sc_bytes INT,
cs_bytes INT,
time_taken INT,
cs_user_agent STRING,
cs_referrer STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '/data/cleanlog';

we inserted data into the table using

INSERT INTO TABLE cleanlog
SELECT *
FROM rawlog
WHERE SUBSTR(log_date, 1, 1) <> '#';

I exited out of hive and typed in below command

hdfs dfs -ls /data/logs
  1. I dont see anything in that folder, why? where did uploaded log files go?
  2. Where is the rawlog table? does it exist in the same folder? Why dont i see it?
  3. Why do i see file 00000_0 in my cleanlog folder? is it the new table? If i type command

    hdfs dfs -ls /data/cleanlog

The output that i get is

Found 1 items
-rwxr-xr-x   1 sshuser supergroup   71323206 2016-03-03 16:11 /data/cleanlog/000000_0
################----------------------------------update 1
  1. What would happen if load one more data file at /data/logs/ and then run select * from rawlog? would it automatically pull data from the new file?

Upvotes: 2

Views: 258

Answers (2)

Ravindra babu
Ravindra babu

Reputation: 38910

If you don't want to lose data in source folder, use external table. Have a look at this SE question:

Difference between `load data inpath ` and `location` in hive?

  1. I dont see anything in that folder, why? where did uploaded log files go?

They have been removed as data is loaded in table and you have used load data in path instead of external table

  1. Where is the rawlog table? does it exist in the same folder? Why dont i see it?

Table definition does not exists in the folder where data resides. In your create table statement, you have already quoted the location of table data to be stored as /data/cleanlog

Have a look at below queries on where does hive stores files in hdfs.

Where does Hive store files in HDFS?

I have created a table in hive, I would like to know which directory my table is created in?

  1. Why do i see file 00000_0 in my cleanlog folder? is it the new table?

It's not new table. Execute this command in hive shell.

describe formatted <table_name>;

EDIT: Regarding incremental updates to table,follow the steps as per this article and this question : Delta/Incremental Load in Hive

Upvotes: 2

Graeme Malcolm
Graeme Malcolm

Reputation: 11

You used the LOAD command, which MOVED the files from their original location to the folder for the rawlog table (which by default will be /hive/warehouse/rawlog).

Upvotes: 1

Related Questions