Reputation: 235
I have some data in HDFS that I would like to create an external table and query via Impala. The data is tab separated but also contains the field name. Example data:
state:IL city:chicago population:2714856
state:NY city:New York population:8336697
I know how to create a table and specify the data is tab delimited, but is there a way to handle the fields in the data?
Upvotes: 2
Views: 7821
Reputation: 2333
The solution in Impala uses the same REGEXP_EXTRACT logic as the Pig example that I posted earier.
--csp.txt (input file, residing in /user/cloudera/csp)
state:New York city:New York population:8336697
state:California city:Los Angeles population:3857799
state:Illinois city:Chicago population:2714856
state:Texas city:Houston population:2160821
state:Pennsylvania city:Philadelphia population:1547607
state:Arizona city:Phoenix population:1488750
state:Texas city:San Antonio population:1382951
state:California city:San Diego population:1338348
state:Texas city:Dallas population:1241162
state:California city:San Jose population:982765
state:Texas city:Austin population:842592
Create Database and External Table
CREATE DATABASE IF NOT EXISTS CSP COMMENT 'City, State, Population';
DROP TABLE IF EXISTS CSP.original;
CREATE EXTERNAL TABLE IF NOT EXISTS CSP.original
(
st STRING COMMENT 'State',
ct STRING COMMENT 'City',
po STRING COMMENT 'Population'
)
COMMENT 'Original Table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/user/cloudera/csp';
Select statement to regexp out the "state:", "city:" and "population:" text
SELECT
regexp_extract(st, '.*:(\\w.*)', 1) AS state,
regexp_extract(ct, '.*:(\\w.*)', 1) AS city,
regexp_extract(po, '.*:(\\w.*)', 1) AS population
FROM original;
Query Results
[localhost.localdomain:21000] > select regexp_extract(st, '.*:(\\w.*)', 1) AS state, regexp_extract(ct, '.*:(\\w.*)', 1) AS city, regexp_extract(po, '.*:(\\w.*)', 1) AS population FROM original limit 11;
Query: select regexp_extract(st, '.*:(\\w.*)', 1) AS state, regexp_extract(ct, '.*:(\\w.*)', 1) AS city, regexp_extract(po, '.*:(\\w.*)', 1) AS population FROM original limit 11
+--------------+--------------+------------+
| state | city | population |
+--------------+--------------+------------+
| New York | New York | 8336697 |
| California | Los Angeles | 3857799 |
| Illinois | Chicago | 2714856 |
| Texas | Houston | 2160821 |
| Pennsylvania | Philadelphia | 1547607 |
| Arizona | Phoenix | 1488750 |
| Texas | San Antonio | 1382951 |
| California | San Diego | 1338348 |
| Texas | Dallas | 1241162 |
| California | San Jose | 982765 |
| Texas | Austin | 842592 |
+--------------+--------------+------------+
Returned 11 row(s) in 0.22s
The easiest way for me to conceptualize the process was actually in Pig first, so I mocked up a data file using your syntax, and created the program in Pig. The output of the program is a csv formatted file, which can be used to create the Impala external table, if you like.
--csp.pig
REGISTER piggybank.jar
A = LOAD 'csp.txt' USING PigStorage('\t') AS (st:chararray,ct:chararray,po:chararray);
data = FOREACH A GENERATE
REGEX_EXTRACT(st, '.*:(\\w.*)', 1) AS (state:chararray),
REGEX_EXTRACT(ct, '.*:(\\w.*)', 1) AS (city:chararray),
REGEX_EXTRACT(po, '.*:(\\w.*)', 1) AS (population:int);
STORE data INTO 'csp' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE');
--csp.txt (input)
state:New York city:New York population:8336697
state:California city:Los Angeles population:3857799
state:Illinois city:Chicago population:2714856
state:Texas city:Houston population:2160821
state:Pennsylvania city:Philadelphia population:1547607
state:Arizona city:Phoenix population:1488750
state:Texas city:San Antonio population:1382951
state:California city:San Diego population:1338348
state:Texas city:Dallas population:1241162
state:California city:San Jose population:982765
state:Texas city:Austin population:842592
--csp (output)
New York,New York,8336697
California,Los Angeles,3857799
Illinois,Chicago,2714856
Texas,Houston,2160821
Pennsylvania,Philadelphia,1547607
Arizona,Phoenix,1488750
Texas,San Antonio,1382951
California,San Diego,1338348
Texas,Dallas,1241162
California,San Jose,982765
Texas,Austin,842592
Upvotes: 2