Counting Words while including special characters and disregarding capitilization in Pyspark?

Question

I'm working on a small project to understand PySpark and I'm trying to get PySpark to do the following actions on the words in a txtfile; it should "ignore" any changes in capitalization to the words (i.e, While vs while) and it should "ignore" any additional characters that might be on the end of the words (i.e, orange vs orange, vs orange. vs orange?) and count them all as the same word.

I am fairly certain some kind of lambda function or regex expression is required, but I don't know how to generalize it enough that I can pop any sort of textfile (like a book) in and have it spit back the correct analysis.

Here's my Code so far:

import sys

from pyspark import SparkContext, SparkConf

input = sc.textFile("/home/user/YOURFILEHERE.txt")
words = input.flatMap(lambda line: line.split(" "))
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a +b)
wordCounts.collect()

The last thing I need to do is make a frequency analysis for the words (i.e, the word "While" shows up 80% of the time) but I am fairly certain how to do that and am currently adding it in for what I have now; I'm just having so many issues with the capitalization and the special character inclusion.

Any help on this issue, even just guidance would be great. Thank you guys!

Counting Words while including special characters and disregarding capitilization in Pyspark?

Answers (1)

Related Questions