Reputation: 89
I am pretty new with HiveQL and I am kinda stuck :S
I have a table of the following schema. One column named res and three partitioned under partion_column named filed.
create table results( res string) PARTITIONED BY (field STRING);
I then imported data in this table
insert overwrite table results PARTITION (field= 'title') SELECT explode(line) AS myNewCol FROM titles ;
insert overwrite table results PARTITION (field= 'artist') SELECT explode(line) AS myNewCol FROM artist;
insert overwrite table results PARTITION (field= 'albums') SELECT explode(line) AS myNewCol FROM albums;
I am trying to count the unique tubles in the three partitions.
For example this command count the number of existence of certain titles in the dataset.
SELECT res, count(1) AS counttotal FROM results where field='title' GROUP BY res ORDER BY counttotal;
and it outputs something like
title count
Hit me Baby More time 9
How can I extend this to tuples ( title, album, artist)? If I want to have an output like :
title album artist count
Baby one more time hit me baby one more time britney spears 9
My whole code:
CREATE EXTERNAL TABLE IF NOT EXISTS hivetesttable (
xmldata STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
location '/user/sdasd/hivetestdata/';
create view xmlout(line) as select * from hivetesttable;
CREATE VIEW TITLES(line) as select xpath(line,'/MC/SC/*/@ttl') from xmlout;
CREATE VIEW ARTIST(line) as select xpath(line,'/MC/SC/*/@art') from xmlout;
CREATE VIEW ALBUMS( line) as select xpath(line,'/MC/SC/*/@art') from xmlout;
create table results( res string) PARTITIONED BY (field STRING);
insert overwrite table results PARTITION (field= 'title') SELECT explode(line) AS myNewCol FROM titles ;
insert overwrite table results PARTITION (field= 'artist') SELECT explode(line) AS myNewCol FROM artist;
insert overwrite table results PARTITION (field= 'albums') SELECT explode(line) AS myNewCol FROM albums;
SELECT res, count(1) AS counttotal FROM results where field='title' GROUP BY res ORDER BY counttotal;
a row of the xml data is like
<?xml version="1.0" encoding="UTF-8"?><MC><SC><S uid="2" gen="" yr="2011" art="Samsung" cmp="<unknown>" fld="/mnt/sdcard/Samsung/Music" alb="Samsung" ttl="Over the horizon"/><S uid="37" gen="" yr="2010" art="Jason Derulo" cmp="<unknown>" fld="/mnt/sdcard/Music/Jason Derulo/Jason Derulo" alb="Jason Derulo" ttl="Whatcha Say"/><S uid="38" gen="" yr="2010" art="Jason Derulo" cmp="<unknown>" fld="/mnt/sdcard/Music/Jason Derulo/Jason Derulo" alb="Jason Derulo" ttl="In My Head"/><S uid="39" gen="" yr="2011" art="Alexandra Stan" cmp="<unknown>" fld="/mnt/sdcard/Music/Alexandra Stan/Mr_ Saxobeat - Single" alb="Mr. Saxobeat - Single" ttl="Mr. Saxobeat (Extended Version)"/><S uid="40" gen="" yr="2011" art="Bushido" cmp="<unknown>" fld="/mnt/sdcard/Music/Bushido/Jenseits von Gut und Böse (Premium Edition)" alb="Jenseits von Gut und Böse (Premium Edition)" ttl="Wie ein Löwe"/><S uid="41" gen="" yr="2011" art="Bushido" cmp="<unknown>" fld="/mnt/sdcard/Music/Bushido/Jenseits von Gut und Böse (Premium Edition)" alb="Jenseits von Gut und Böse (Premium Edition)" ttl="Verreckt"/><S uid="42" gen="" yr="2011" art="Lucenzo" cmp="<unknown>" fld="/mnt/sdcard/Music/Lucenzo/Danza Kuduro (feat_ Don Omar) [From _Fast & Furious 5_] - Single" alb="Danza Kuduro (feat. Don Omar) [From "Fast & Furious 5"] - Single" ttl="Danza Kuduro (feat. Don Omar) [From "Fast & Furious 5"]"/><S uid="121" gen="" yr="701" art="Michael Jackson" cmp="<unknown>" fld="/mnt/sdcard/external_sd/Music/Michael Jackson/Bad [Bonus Tracks]" alb="Bad [Bonus Tracks]" ttl="Voice-Over Intro/Quincy Jones Interview #1 [*]"/></SC><PC/></MC>
Upvotes: 1
Views: 3262
Reputation: 2573
Based on the information you've provided, the output you want is not possible. Right now you have a table that looks like this:
res field
--- -----
baby one more time title
baby one more time title
baby one more time title
baby one more time title
baby one more time title
baby one more time title
baby one more time title
baby one more time title
baby one more time title
hit me baby one more time album
hit me baby one more time album
hit me baby one more time album
hit me baby one more time album
hit me baby one more time album
hit me baby one more time album
hit me baby one more time album
hit me baby one more time album
hit me baby one more time album
britney spears artist
britney spears artist
britney spears artist
britney spears artist
britney spears artist
britney spears artist
britney spears artist
britney spears artist
britney spears artist
the distance title
the distance title
open book title
daria title
fashion nugget album
fashion nugget album
fashion nugget album
fashion nugget album
cake artist
cake artist
cake artist
cake artist
Because you partitioned it, Hive happens to store it in three different folders but this doesn't affect the results of the query. I added some extra tracks and I imagine with the extra tracks you would want the output to be (correct me if I'm wrong):
title album artist count
baby one more time hit me baby one mroe time britney spears 9
the distance fashion nuggets cake 2
open book fashion nuggets cake 1
daria fashion nuggets cake 1
But there is no way to tell that "open book" has anything to do with "fashion nuggets" or "cake", just like there is no way to tell that "baby one more time" is associated with "britney spears". You could try to match on the counts but then you would end up with something like this
title album artist count
baby one more time hit me baby one more time britney spears 9
null fashion nuggets cake 3
the distance null null 1
open book,daria null null 1
I think you wanted a table with columns like this
title album artist
baby one more hit me baby one more time britney spears
baby one more hit me baby one more time britney spears
baby one more hit me baby one more time britney spears
baby one more hit me baby one more time britney spears
baby one more hit me baby one more time britney spears
baby one more hit me baby one more time britney spears
baby one more hit me baby one more time britney spears
baby one more hit me baby one more time britney spears
baby one more hit me baby one more time britney spears
the distance fashion nuggets cake
the distance fashion nuggets cake
open book fashion nuggets cake
daria fashion nuggets cake
but still partitioned on maybe artist and/or album. With or without the partitioning, you can write the query as if the table is not partitioned (it doesn't effect results as long as the data isn't corrupted, only performance). It will affect how you create and populate the table, however. Let me know if this is what you wanted an I'll edit this answer to answer that question instead.
THE EDIT AS PROMISED:
Okay, creating the table without any partitions is straightforward:
CREATE TABLE results (title string, album string, artist string)
Creating the table with with partitions is almost as straightforward, you just need to first decided what to partition on. If you partition on artist it will mean you can run queries particular to a single or set of artists without having to process information for other artists. If you partition by artist and album you can do the same thing with albums as well. This does come at the cost of breaking a large file into smaller files, and generally MapReduce (and therefore Hive) works better with large files. I wouldn't worry about partitioning at all unless you dealing with at least 10's of GBs and feel like you have a handle on how paritioning works and HiveQL in general. But for completeness, partitioning by artist:
CREATE TABLE results (title string, album string) PARTITIONED BY (artist string);
and partitioned by artist then by album. Partitioning by (artist string, album string)
vs (album string, artist string)
won't change your results, but you should put the logical top of the hierarchy first.
CREATE TABLE (title string) PARTITIONED BY (artist string, album string);
Populating this table won't be easy if the only information we have access to are from the tables titles, artists, and albums
since we have a huge list of titles, artists, and albums but no way to tell which title goes with which album for example. I hope you have some data where these relationships are still intact or your data set is still intact. Without knowing the form of this hypothetical data, I can't provide an answer for how to populate your tables. But if you have partitioned tables, this answer might be useful to you if you don't want to manually specify every artist and album(since every artist gets there own partition, and within though partition every album gets it's own partition).
EDIT: The asker has xml files which have the title, ablum, arist relationships intact. More information on this in the comments.
Now the meat of the question is counting unique tuples. This will be the same regardless of how data was partitioned, if at all. We do this using the GROUP BY
clause. When you specify one column (or partition, which can be thought of as a column with special properties), you break the data down into groups which have distinct values for that column. If you specify several columns, you break the data down into groups with have distinct values for their combination of columns. This is the we take advantage of to count distinct tuples:
SELECT title, album, artist, COUNT(*)
FROM results
GROUP BY title, album, artist
and here we are:
title album artist count
baby one more time hit me baby one mroe time britney spears 9
the distance fashion nuggets cake 2
open book fashion nuggets cake 1
daria fashion nuggets cake 1
Upvotes: 1