Reputation: 1028
First off, apologies if this comes across poorly worded, I've tried to help myself but I'm not clear on where its not right.
I'm trying to query data in Impala which has been exported from another system.
Up till now its been exported as a pipe-delimited text file which I've been able to import fine by creating the table with the right delimiter set-up, copying in the file and then running a refresh
statement.
We've had some issues where some fields have line-break characters and this has made it look like we've got more data and it doesn't necessarily fit the metadata I've created.
The suggestion was made that we could use Parquet format instead and this would cope with the internal line-breaks fine.
I've received data and it looks a bit like this (I changed the username):
-rw-r--r--+ 1 UserName Domain Users 20M Jan 17 10:15 part-00000-6a763116-6728-4467-a641-32dd710857fe.snappy.parquet
-rw-r--r--+ 1 UserName Domain Users 156K Jan 17 10:15 .part-00000-6a763116-6728-4467-a641-32dd710857fe.snappy.parquet.crc
-rw-r--r--+ 1 UserName Domain Users 14M Jan 17 10:15 part-00001-6a763116-6728-4467-a641-32dd710857fe.snappy.parquet
-rw-r--r--+ 1 UserName Domain Users 110K Jan 17 10:15 .part-00001-6a763116-6728-4467-a641-32dd710857fe.snappy.parquet.crc
-rw-r--r--+ 1 UserName Domain Users 0 Jan 17 10:15 _SUCCESS
-rw-r--r--+ 1 UserName Domain Users 8 Jan 17 10:15 ._SUCCESS.crc
If I create a table stored as parquet through Impala and then do an hdfs dfs -ls
on that I get something like the following:
-rwxrwx--x+ 3 hive hive 2103 2019-01-23 10:00 /filepath/testtable/594eb1cd032d99ad-5c13d29e00000000_1799839777_data.0.parq
drwxrwx--x+ - hive hive 0 2019-01-23 10:00 /filepath/testtable/_impala_insert_staging
Which is obviously a bit different to what I've received...
How do I create the table in Impala to be able to accept what I've received and also do I just need the .parquet files in there or do I also need to put the .parquet.crc files in?
Or is what I've received not fit for purpose?
I've tried looking at the Impala documentation for this bit but I don't think that's covering it.
Is it something that I need to do with serde?
I tried specifiying the compression_codec as snappy but this gave the same results.
Any help would be appreciated.
Upvotes: 0
Views: 2874
Reputation: 3105
The names of the files do not matter, as long as they are not some special files (like _SUCCESS
or .something.crc
), they will be read by Impala as Parquet files. You don't need the .crc
or _SUCCESS
files.
You can use Parquet files from an external source in Impala in two ways:
First create a Parquet table in Impala then put the external files into the directory that correspons to the table.
Create a directory, put the external files into it and then create a so-called external table in Impala. (You can put more data files there later as well.)
After putting external files in tables, you have to issue INVALIDATE METADATA table_name;
to make Impala check for new files.
The syntax for creating a regular Parquet table is
CREATE TABLE table_name (col_name data_type, ...)
STORED AS PARQUET;
The syntax for creating an external Parquet table is
CREATE EXTERNAL TABLE table_name (col_name data_type, ...)
STORED AS PARQUET LOCATION '/path/to/directory';
An excerpt from the Overview of Impala Tables section of the docs:
Physically, each table that uses HDFS storage is associated with a directory in HDFS. The table data consists of all the data files underneath that directory:
- Internal tables are managed by Impala, and use directories inside the designated Impala work area.
- External tables use arbitrary HDFS directories, where the data files are typically shared between different Hadoop components.
An excerpt from the CREATE TABLE Statement section of the docs:
By default, Impala creates an "internal" table, where Impala manages the underlying data files for the table, and physically deletes the data files when you drop the table. If you specify the EXTERNAL clause, Impala treats the table as an "external" table, where the data files are typically produced outside Impala and queried from their original locations in HDFS, and Impala leaves the data files in place when you drop the table. For details about internal and external tables, see Overview of Impala Tables.
Upvotes: 0