suhabari
suhabari

Reputation: 135

Hive: Split string using regexp as a separate column

I have a string in a text column. I want to extract the hashtag values from the string into a new table so that I can find the distinct count for each hashtag.

Example strings->

NeverTrump is never more. They were crushed last night in Cleveland at Rules Committee by a vote of 87-12. MAKE AMERICA GREAT AGAIN!

CrookedHillary is outspending me by a combined 31 to 1 in Florida, Ohio, & Pennsylvania. I haven't started yet!

CrookedHillary is not qualified!

MakeAmericaSafeAgain!#GOPConvention #RNCinCLE

MakeAmericaGreatAgain #ImWithYou

Upvotes: 2

Views: 549

Answers (2)

leftjoin
leftjoin

Reputation: 38290

This is what @lazilyInitialised said, I did a query with your data example:

with your_data as (--This is your data example, use your table instead of this CTE
select stack( 1, 
1, --ID
"   NeverTrump is never more. They were crushed last night in Cleveland at Rules Committee by a vote of 87-12. MAKE AMERICA GREAT AGAIN!

    CrookedHillary is outspending me by a combined 31 to 1 in Florida, Ohio, & Pennsylvania. I haven't started yet!

    CrookedHillary is not qualified!

    MakeAmericaSafeAgain!#GOPConvention #RNCinCLE

    MakeAmericaGreatAgain #ImWithYou
"
) as (id, str)
)

select id, word as hashtag
from
(
select id, word
  from your_data d
       lateral view outer explode(split(regexp_replace(d.str, '#',' #' ),'\\s')) l as word --replace hash w space+hash, split and explode words
)s 
where word rlike '^#'      
;

Result:

OK
id      hashtag
1       #GOPConvention
1       #RNCinCLE
1       #ImWithYou
Time taken: 0.405 seconds, Fetched: 3 row(s)

Upvotes: 2

lazilyInitialised
lazilyInitialised

Reputation: 108

I am outlining the steps here as I'm not that good with the query, may update the answer once I get it right

  1. Replace '#' in string by ' #'.
  2. split each word in a string with space as delimiter.
  3. use explode() lateral view functionality to get all the words of the string.
  4. use a WHERE condition to fetch records starting with "#". LIKE '#%' condition should work.
  5. then add the group by condition to get the counts of each hashtag.

Upvotes: 2

Related Questions