Hardik Gupta
Hardik Gupta

Reputation: 4790

Filter By Specific words in spark dataframe

I have a spark dataframe which has following data

    +---------------------------------------------------------------------------------------------------------------------------------------------------+
    |text                                                                                                                                               |
    +---------------------------------------------------------------------------------------------------------------------------------------------------+
    |Know what you don't do at 1:30 when you can't sleep? Music shopping. Now I want to dance. #shutUpAndDANCE                                          |
    |Serasi ade haha @AdeRais "@SMTOWNGLOBAL: #SHINee ONEW(@skehehdanfdldi) and #AMBER(@llama_ajol) at KBS 'Music Bank'."        |
    |Happy Birhday Ps.Jeffrey Rachmat #JR50 #flipagram  ? Music: This I Believe (The Creed) - Hillsong…                          |

The dataframe is of one column 'text' and has words containing # in them. e.g. '#shutUpAndDANCE'

I am trying to read each word and filter out so that I am left with a list of words only with hash

Code:

#Gets only those rows containing
hashtagList = sqlContext.sql("SELECT text FROM tweetstable WHERE text LIKE '%#%'")
print hashtagList.show(100, truncate=False)

#Process Rows to get the words
hashtagList = hashtagList.map(lambda p: p.text).map(lambda x: x.split(" ")).collect() 
print hashtagList

The output is :

[[u'Know', u'what', u'you', u"don't", u'do', u'at', u'1:30', u'when', u'you', u"can't", u'sleep?', u'Music', u'shopping.', u'Now', u'I', u'want', u'to', u'dance.', u'#shutUpAndDANCE'], [...]]

Is there a way that I can filter out everything and keep only the #words during my map stage.

hashtagList = hashtagList.map(lambda p: p.text).map(lambda x: x.split(" "))<ADD SOMETHING HERE TO FETCH ONLY #>.collect()

Upvotes: 1

Views: 3868

Answers (2)

abaghel
abaghel

Reputation: 15317

Try this.

from pyspark.sql import Row
from __future__ import print_function

str = "Know what you don't do at 1:30 when you can't sleep? Music shopping. Now I want to dance. #shutUpAndDANCE Serasi ade haha @AdeRais @SMTOWNGLOBAL: #SHINee ONEW(@skehehdanfdldi) and #AMBER(@llama_ajol) at KBS 'Music Bank'.Happy Birhday Ps.Jeffrey Rachmat #JR50 #flipagram? Music: This I Believe (The Creed) - Hillsong"
df = spark.createDataFrame([Row(str)]);
words = df.rdd.flatMap(list).flatMap(lambda line: line.split()).filter(lambda word: word.startswith("#"));
words.foreach(print)

Upvotes: 2

user6022341
user6022341

Reputation:

Use:

>>> from pyspark.sql.functions import split, explode, col
>>>
>>> df.select(explode(split("text", "\\s+")).alias("word")) \
...     .where(col("word").startswith("#"))

Upvotes: 1

Related Questions