p7k
p7k

Reputation: 161

Creating Impala external table from a partitioned file structure

Provided a partitioned fs structure like the following:

logs
└── log_type
    └── 2013
        ├── 07
        │   ├── 28
        │   │   ├── host1
        │   │   │   └── log_file_1.csv
        │   │   └── host2
        │   │       ├── log_file_1.csv
        │   │       └── log_file_2.csv
        │   └── 29
        │       ├── host1
        │       │   └── log_file_1.csv
        │       └── host2
        │           └── log_file_1.csv
        └── 08

I've been trying to create an external table in Impala:

create external table log_type (
    field1    string,
    field2    string,
    ...
)
row format delimited fields terminated by '|' location '/logs/log_type/2013/08';

I wish Impala would recurse into the subdirs and load all the csv files; but no cigar. No errors are thrown but no data is loaded into the table.

Different globs like /logs/log_type/2013/08/*/* or /logs/log_type/2013/08/*/*/* did not work either.

Is there a way to do this? Or should I restructure the fs - any advice on that?

Upvotes: 4

Views: 7871

Answers (3)

With the newer versions of impala you can use the

ALTER TABLE name RECOVER PARTITIONS

command. More info

What you have to be careful about is that, the partitioning fields has to be lowercase as the the directory structure is case sensitive but the impala queries are not.

Upvotes: 0

Joey
Joey

Reputation: 1349

in case you are still searching for an answer. You need to register each individual partition manually.

See here for details Registering External Table

Your schema for the table needs to be adjusted

create external table log_type (
        field1    string,
        field2    string,
...)
  partitioned by (year int, month int, day int, host string)
  row format delimited fields terminated by '|';

After you changed your schema, to include year, month, day and host, you recursively have to add each partition to the table.

Something like this

ALTER TABLE log_type ADD PARTITION (year=2013, month=07, day=28, host="host1")
    LOCATION '/logs/log_type/2013/07/28/host1';

Afterwards you need to refresh the table in impala.

invalidate log_type;
refresh log_type;

Upvotes: 9

pjames
pjames

Reputation: 302

Another way to do this might be to use the LOAD DATA function in Impala. If your data is in a SequenceFile or other less Impala-friendly format (Impala file formats), you can create your external table like Joey does above but instead of ALTER TABLE, you can do something like

LOAD DATA INPATH '/logs/log_type/2013/07/28/host1/log_file_1.csv' INTO TABLE log_type PARTITION (year=2013, month=07, day=28, host=host1);

Upvotes: 0

Related Questions