Reputation: 401
I have a set of tweets that have many different fields
raw_tweets = LOAD 'input.tsv' USING PigStorage('\t') AS (tweet_id, text,
in_reply_to_status_id, favorite_count, source, coordinates, entities,
in_reply_to_screen_name, in_reply_to_user_id, retweet_count, is_retweet,
retweet_of_id, user_id_id, lang, created_at, event_id_id, is_news);
I want to find the most common words for each date. I managed to group the texts by date:
r1 = FOREACH raw_tweets GENERATE SUBSTRING(created_at,0,10) AS a, REPLACE
(LOWER(text),'([^a-z\\s]+)','') AS b;
r2 = group r1 by a;
r3 = foreach r2 generate group as a, r1 as b;
r4 = foreach r3 generate a, FLATTEN(BagToTuple(b.b));
Now it looks like:
(date text text3)
(date2 text2)
I removed the special characters, so only "real" words appear in the text. Sample:
2017-06-18 the plants are green the dog is black there are words this is
2017-06-19 more words and even more words another phrase begins here
I want the output to look like
2017-06-18 the are is
2017-06-19 more words and
I don't really care about how many times the word appears. I just want to show the most common, if two words appear the same amount of times, show any of them.
Upvotes: 3
Views: 461
Reputation: 192043
While I'm sure there is a way to do this entirely in Pig, it would probably be more difficult than necessary.
UDFs are the way to go, in my opinion, and Python is just one option I will show because it's quick to register it in Pig.
For example,
input.tsv
2017-06-18 the plants are green the dog is black there are words this is
2017-06-19 more words and even more words another phrase begins here
py_udfs.py
from collections import Counter
from operator import itemgetter
@outputSchema("y:bag{t:tuple(word:chararray,count:int)}")
def word_count(sentence):
''' Does a word count of a sentence and orders common words first '''
words = Counter()
for w in sentence.split():
words[w] += 1
values = ((word,count) for word,count in words.items())
return sorted(values,key=itemgetter(1),reverse=True)
script.pig
REGISTER 'py_udfs.py' USING jython AS py_udfs;
A = LOAD 'input.tsv' USING PigStorage('\t') as (created_at:chararray,sentence:chararray);
B = FOREACH A GENERATE created_at, py_udfs.word_count(sentence);
\d B
Output
(2017-06-18,{(is,2),(the,2),(are,2),(green,1),(black,1),(words,1),(this,1),(plants,1),(there,1),(dog,1)})
(2017-06-19,{(more,2),(words,2),(here,1),(another,1),(begins,1),(phrase,1),(even,1),(and,1)})
If you are doing textual analysis, though, I would suggest
Upvotes: 1