Reputation: 235
I would like to create an external HIVE table that utilizes data already in HDFS. The files are in a directory like /hdfs/data/location
and sub directories with the year-month
format. ex: /hdfs/data/location/2013-december
and /hdfs/data/location/2014-january
.
Within these directories are multiple files but within the files are different types of data (different fields). Examples of the different types of records are:
Type A
type
state
city
population
Type B
type
zipcode
registeredvoters
Actual data example (tab delimited)
type:A state:New York city:New York population:8336697
type:A state:California city:Los Angeles population:3857799
type:B zipcode:92118 registeredvoters:794051
type:B zipcode:92155 registeredvoters:794053
type:A state:Illinois city:Chicago population:2714856
The data is already in this format and being used by other processes outside of HIVE, so changing it may not be an option. I would also not want to duplicate the data in HDFS.
Is there a way to create a HIVE table for only a given type as defined in the data above?
Here is what I have so far for the create:
create external table population (
type string,
state string,
city string,
population int
)
location '/hdfs/data/location';
Upvotes: 2
Views: 841
Reputation: 1630
I don't think you can have a table, but I think you could make a view, interpreting the line using the str_to_map UDF
create external table raw_population( line string ) location '/hdfs/data/location';
create view population_view as
select
pmap['type'] as type,
pmap['state'] as state,
pmap['city'] as city
pmap['population'] as population
from
( select str_to_map( line, '\t', ':') as pmap from raw_population ) pm;
Upvotes: 4