Tim Edwards
Tim Edwards

Reputation: 1028

Using Externally created Parquet files in Impala

First off, apologies if this comes across poorly worded, I've tried to help myself but I'm not clear on where its not right.

I'm trying to query data in Impala which has been exported from another system. Up till now its been exported as a pipe-delimited text file which I've been able to import fine by creating the table with the right delimiter set-up, copying in the file and then running a refresh statement.

We've had some issues where some fields have line-break characters and this has made it look like we've got more data and it doesn't necessarily fit the metadata I've created.
The suggestion was made that we could use Parquet format instead and this would cope with the internal line-breaks fine.

I've received data and it looks a bit like this (I changed the username):

-rw-r--r--+ 1 UserName Domain Users  20M Jan 17 10:15 part-00000-6a763116-6728-4467-a641-32dd710857fe.snappy.parquet
-rw-r--r--+ 1 UserName Domain Users 156K Jan 17 10:15 .part-00000-6a763116-6728-4467-a641-32dd710857fe.snappy.parquet.crc
-rw-r--r--+ 1 UserName Domain Users  14M Jan 17 10:15 part-00001-6a763116-6728-4467-a641-32dd710857fe.snappy.parquet
-rw-r--r--+ 1 UserName Domain Users 110K Jan 17 10:15 .part-00001-6a763116-6728-4467-a641-32dd710857fe.snappy.parquet.crc
-rw-r--r--+ 1 UserName Domain Users    0 Jan 17 10:15 _SUCCESS
-rw-r--r--+ 1 UserName Domain Users    8 Jan 17 10:15 ._SUCCESS.crc

If I create a table stored as parquet through Impala and then do an hdfs dfs -ls on that I get something like the following:

-rwxrwx--x+  3 hive hive       2103 2019-01-23 10:00 /filepath/testtable/594eb1cd032d99ad-5c13d29e00000000_1799839777_data.0.parq
drwxrwx--x+  - hive hive          0 2019-01-23 10:00 /filepath/testtable/_impala_insert_staging

Which is obviously a bit different to what I've received...

How do I create the table in Impala to be able to accept what I've received and also do I just need the .parquet files in there or do I also need to put the .parquet.crc files in?

Or is what I've received not fit for purpose?

I've tried looking at the Impala documentation for this bit but I don't think that's covering it.
Is it something that I need to do with serde?
I tried specifiying the compression_codec as snappy but this gave the same results.

Any help would be appreciated.

Upvotes: 0

Views: 2874

Answers (1)

Zoltan
Zoltan

Reputation: 3105

The names of the files do not matter, as long as they are not some special files (like _SUCCESS or .something.crc), they will be read by Impala as Parquet files. You don't need the .crc or _SUCCESS files.

You can use Parquet files from an external source in Impala in two ways:

  1. First create a Parquet table in Impala then put the external files into the directory that correspons to the table.

  2. Create a directory, put the external files into it and then create a so-called external table in Impala. (You can put more data files there later as well.)

After putting external files in tables, you have to issue INVALIDATE METADATA table_name; to make Impala check for new files.

The syntax for creating a regular Parquet table is

CREATE TABLE table_name (col_name data_type, ...)
  STORED AS PARQUET;

The syntax for creating an external Parquet table is

CREATE EXTERNAL TABLE table_name (col_name data_type, ...)
  STORED AS PARQUET LOCATION '/path/to/directory';

An excerpt from the Overview of Impala Tables section of the docs:

Physically, each table that uses HDFS storage is associated with a directory in HDFS. The table data consists of all the data files underneath that directory:

  • Internal tables are managed by Impala, and use directories inside the designated Impala work area.
  • External tables use arbitrary HDFS directories, where the data files are typically shared between different Hadoop components.

An excerpt from the CREATE TABLE Statement section of the docs:

By default, Impala creates an "internal" table, where Impala manages the underlying data files for the table, and physically deletes the data files when you drop the table. If you specify the EXTERNAL clause, Impala treats the table as an "external" table, where the data files are typically produced outside Impala and queried from their original locations in HDFS, and Impala leaves the data files in place when you drop the table. For details about internal and external tables, see Overview of Impala Tables.

Upvotes: 0

Related Questions