Reputation: 29
I'm working on a small project to understand PySpark and I'm trying to get PySpark to do the following actions on the words in a txtfile; it should "ignore" any changes in capitalization to the words (i.e, While vs while) and it should "ignore" any additional characters that might be on the end of the words (i.e, orange vs orange, vs orange. vs orange?) and count them all as the same word.
I am fairly certain some kind of lambda function or regex expression is required, but I don't know how to generalize it enough that I can pop any sort of textfile (like a book) in and have it spit back the correct analysis.
Here's my Code so far:
import sys
from pyspark import SparkContext, SparkConf
input = sc.textFile("/home/user/YOURFILEHERE.txt")
words = input.flatMap(lambda line: line.split(" "))
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a +b)
wordCounts.collect()
The last thing I need to do is make a frequency analysis for the words (i.e, the word "While" shows up 80% of the time) but I am fairly certain how to do that and am currently adding it in for what I have now; I'm just having so many issues with the capitalization and the special character inclusion.
Any help on this issue, even just guidance would be great. Thank you guys!
Upvotes: 2
Views: 360
Reputation: 725
just replace the input with your text file, the key is the function word_munge
import string
import re
def word_munge(single_word):
lower_case_word=single_word.lower()
return re.sub(f"[{re.escape(string.punctuation)}]", "", lower_case_word)
input_string="While orange, while orange while orange."
input_rdd = sc.parallelize([input_string])
words = input_rdd.flatMap(lambda line: line.split(" "))
(words.
map(word_munge).
map(lambda word: (word, 1)).
reduceByKey(lambda a, b: a+ b)
).take(2)
Upvotes: 1