user1309258
user1309258

Reputation: 89

Hive count tuple?

I am pretty new with HiveQL and I am kinda stuck :S

I have a table of the following schema. One column named res and three partitioned under partion_column named filed.

create table results( res string) PARTITIONED BY (field STRING); 

I then imported data in this table

insert overwrite table results PARTITION (field= 'title') SELECT  explode(line) AS myNewCol FROM titles ;
insert overwrite table results PARTITION (field= 'artist') SELECT  explode(line) AS myNewCol FROM artist;
insert overwrite table results PARTITION (field= 'albums') SELECT  explode(line) AS myNewCol FROM albums;

I am trying to count the unique tubles in the three partitions.

For example this command count the number of existence of certain titles in the dataset.

 SELECT res, count(1) AS counttotal   FROM results where field='title' GROUP BY res ORDER BY counttotal;

and it outputs something like

 title                                count        
 Hit me Baby More time                   9

How can I extend this to tuples ( title, album, artist)? If I want to have an output like :

title                            album                 artist       count

Baby one more time    hit me baby one more time    britney spears    9

My whole code:

CREATE EXTERNAL TABLE IF NOT EXISTS hivetesttable  (
xmldata STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
location '/user/sdasd/hivetestdata/';

create view xmlout(line) as  select * from hivetesttable;  

CREATE VIEW TITLES(line) as select xpath(line,'/MC/SC/*/@ttl')  from xmlout;
CREATE VIEW ARTIST(line) as select  xpath(line,'/MC/SC/*/@art')  from xmlout;
CREATE VIEW ALBUMS( line) as select   xpath(line,'/MC/SC/*/@art') from xmlout;



create table results( res string) PARTITIONED BY (field STRING); 
insert overwrite table results PARTITION (field= 'title') SELECT  explode(line) AS myNewCol FROM titles ;
insert overwrite table results PARTITION (field= 'artist') SELECT  explode(line) AS myNewCol FROM artist;
insert overwrite table results PARTITION (field= 'albums') SELECT  explode(line) AS myNewCol FROM albums;

SELECT res, count(1) AS counttotal   FROM results where field='title' GROUP BY res ORDER BY counttotal;

a row of the xml data is like

<?xml version="1.0" encoding="UTF-8"?><MC><SC><S uid="2" gen="" yr="2011" art="Samsung" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Samsung/Music" alb="Samsung" ttl="Over the horizon"/><S uid="37" gen="" yr="2010" art="Jason Derulo" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Jason Derulo/Jason Derulo" alb="Jason Derulo" ttl="Whatcha Say"/><S uid="38" gen="" yr="2010" art="Jason Derulo" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Jason Derulo/Jason Derulo" alb="Jason Derulo" ttl="In My Head"/><S uid="39" gen="" yr="2011" art="Alexandra Stan" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Alexandra Stan/Mr_ Saxobeat - Single" alb="Mr. Saxobeat - Single" ttl="Mr. Saxobeat (Extended Version)"/><S uid="40" gen="" yr="2011" art="Bushido" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Bushido/Jenseits von Gut und Böse (Premium Edition)" alb="Jenseits von Gut und Böse (Premium Edition)" ttl="Wie ein Löwe"/><S uid="41" gen="" yr="2011" art="Bushido" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Bushido/Jenseits von Gut und Böse (Premium Edition)" alb="Jenseits von Gut und Böse (Premium Edition)" ttl="Verreckt"/><S uid="42" gen="" yr="2011" art="Lucenzo" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Lucenzo/Danza Kuduro (feat_ Don Omar) [From _Fast &amp; Furious 5_] - Single" alb="Danza Kuduro (feat. Don Omar) [From &quot;Fast &amp; Furious 5&quot;] - Single" ttl="Danza Kuduro (feat. Don Omar) [From &quot;Fast &amp; Furious 5&quot;]"/><S uid="121" gen="" yr="701" art="Michael Jackson" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/external_sd/Music/Michael Jackson/Bad [Bonus Tracks]" alb="Bad [Bonus Tracks]" ttl="Voice-Over Intro/Quincy Jones Interview #1 [*]"/></SC><PC/></MC>

Upvotes: 1

Views: 3262

Answers (1)

Daniel Koverman
Daniel Koverman

Reputation: 2573

Based on the information you've provided, the output you want is not possible. Right now you have a table that looks like this:

res                           field
---                           -----
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
the distance                  title
the distance                  title
open book                     title
daria                         title
fashion nugget                album
fashion nugget                album
fashion nugget                album
fashion nugget                album
cake                          artist
cake                          artist
cake                          artist
cake                          artist

Because you partitioned it, Hive happens to store it in three different folders but this doesn't affect the results of the query. I added some extra tracks and I imagine with the extra tracks you would want the output to be (correct me if I'm wrong):

title                  album                       artist              count
baby one more time     hit me baby one mroe time   britney spears      9
the distance           fashion nuggets             cake                2
open book              fashion nuggets             cake                1
daria                  fashion nuggets             cake                1

But there is no way to tell that "open book" has anything to do with "fashion nuggets" or "cake", just like there is no way to tell that "baby one more time" is associated with "britney spears". You could try to match on the counts but then you would end up with something like this

title                  album                       artist              count
baby one more time     hit me baby one more time   britney spears      9
null                   fashion nuggets             cake                3
the distance           null                        null                1
open book,daria        null                        null                1

I think you wanted a table with columns like this

title                  album                         artist
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
the distance           fashion nuggets               cake
the distance           fashion nuggets               cake
open book              fashion nuggets               cake
daria                  fashion nuggets               cake

but still partitioned on maybe artist and/or album. With or without the partitioning, you can write the query as if the table is not partitioned (it doesn't effect results as long as the data isn't corrupted, only performance). It will affect how you create and populate the table, however. Let me know if this is what you wanted an I'll edit this answer to answer that question instead.


THE EDIT AS PROMISED:

Okay, creating the table without any partitions is straightforward:

CREATE TABLE results (title string, album string, artist string)

Creating the table with with partitions is almost as straightforward, you just need to first decided what to partition on. If you partition on artist it will mean you can run queries particular to a single or set of artists without having to process information for other artists. If you partition by artist and album you can do the same thing with albums as well. This does come at the cost of breaking a large file into smaller files, and generally MapReduce (and therefore Hive) works better with large files. I wouldn't worry about partitioning at all unless you dealing with at least 10's of GBs and feel like you have a handle on how paritioning works and HiveQL in general. But for completeness, partitioning by artist:

CREATE TABLE results (title string, album string) PARTITIONED BY (artist string);

and partitioned by artist then by album. Partitioning by (artist string, album string) vs (album string, artist string) won't change your results, but you should put the logical top of the hierarchy first.

CREATE TABLE (title string) PARTITIONED BY (artist string, album string);

Populating this table won't be easy if the only information we have access to are from the tables titles, artists, and albums since we have a huge list of titles, artists, and albums but no way to tell which title goes with which album for example. I hope you have some data where these relationships are still intact or your data set is still intact. Without knowing the form of this hypothetical data, I can't provide an answer for how to populate your tables. But if you have partitioned tables, this answer might be useful to you if you don't want to manually specify every artist and album(since every artist gets there own partition, and within though partition every album gets it's own partition).

EDIT: The asker has xml files which have the title, ablum, arist relationships intact. More information on this in the comments.

Now the meat of the question is counting unique tuples. This will be the same regardless of how data was partitioned, if at all. We do this using the GROUP BY clause. When you specify one column (or partition, which can be thought of as a column with special properties), you break the data down into groups which have distinct values for that column. If you specify several columns, you break the data down into groups with have distinct values for their combination of columns. This is the we take advantage of to count distinct tuples:

SELECT title, album, artist, COUNT(*)
FROM results
GROUP BY title, album, artist

and here we are:

title                  album                       artist              count
baby one more time     hit me baby one mroe time   britney spears      9
the distance           fashion nuggets             cake                2
open book              fashion nuggets             cake                1
daria                  fashion nuggets             cake                1

Upvotes: 1

Related Questions