mowen10
mowen10

Reputation: 393

How to combine split and count in hive

I am trying to write a query to split a column after every @ character. I then want to be able to do a count of these within each segment.

I have managed to write the following query in Hive:

SELECT 
distinct split (msg_txt,'\\@')[0] AS first_msg, count(*)
FROM table1
;

but this won't allow me to add a group by in order to get a count. I tried doing this with a subquery:

SELECT first_msg, count(*)
FROM (
SELECT 
distinct split (msg_txt,'\\@')[0] AS first_msg
FROM table1
)
GROUP BY first_msg
;

but this gives me the following error:

Error while compiling statement: FAILED: ParseException line 7:6 missing EOF at 'BY' near 'GROUP'

so not sure how I can write this query.

If somebody could please advise would really appreciate it.

Thanks in advance.

Upvotes: 0

Views: 2110

Answers (2)

hlagos
hlagos

Reputation: 7947

base on your requirement, I am not sure why you are getting the first element. The query to ignore the first element of the split (considering you want to apply the group for all the elements after "@) should look like this

select value, count(*) from (
select 
pos,value
from table1 lateral view posexplode(split (msg_txt,'\\@')) explodedcol as pos,value limit 10
) t where pos != 0 group by value
;

if you want to include all the elements splitting by "@", just remove the "post != 0" condition from the where clause.

Regards,

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1269623

I think you just need a table alias:

SELECT first_msg, count(*)
FROM (SELECT distinct split(msg_txt,'\\@')[0] AS first_msg
      FROM table1
     ) t
GROUP BY first_msg;

Hive requires a table aliases:

The subquery has to be given a name because every table in a FROM clause must have a name.

In your version, it treats GROUP as the name of the subquery. The BY then doesn't make sense.

As written, this is a bit non-sensical because you can just do:

SELECT distinct split(msg_txt,'\\@')[0] AS first_msg, 1 as cnt
FROM table1;

The distinct in the subquery will ensure that all values are unique. I assume your actual problem is a little more complicated.

Upvotes: 1

Related Questions