Pyspark operations on text, counting words, unique words, most common words

Question

So I basically have a text, which is Moby Dick. I converted it into RDD and it looks like this:

['The Project Gutenberg EBook of Moby Dick; or The Whale, by Herman',
 'Melville',
 'This eBook is for the use of anyone anywhere at no cost and with almost',

I have to count all words, count unique words, find 10 most common words and count how often word "whale" appears in a whole. For the task, I have to split each phrase into separate words and remove blank lines:

MD = rawMD.filter(lambda x: x != "")

For counting all the words:

MDcount = MD.map(lambda x: x.split(" ")).flatMap(lambda x: x).filter(lambda x: x != "")
MDcount.count()

And the result is: 214376, which i think is not good but anyway.

Then, for the unique words:

MDcount.distinct().count()

Result: 33282. And there is a problem, I don't know how to delete "'s" from words like "Whale's"

I tried this one instead:

DFcount = DFcount.select("Word", regexp_replace(col("Word"), "[_\"\'():;,.!?\-]", "").alias("Clear"))
DFcount = DFcount.drop("Word")
DFcount.distinct().count()

Result: 23187 words

But still, not good enough. Wikipedia says that Moby Dick has approximately 16000 unique words.

For the most common words:

Word_count = DFcount.groupby('Clear').count()
Word_count.orderBy(desc('count')).show(10)

With te result:

+-----+-----+
|Clear|count|
+-----+-----+
|  the|13838|
|   of| 6654|
|  and| 6040|
|   to| 4582|
|    a| 4543|
|   in| 3950|
| that| 2857|
|  his| 2459|
|   it| 2060|
|    I| 1834|
+-----+-----+

And for the count of "whale"

RDDcount = DFcount.rdd.map(lambda x: x[0])
MDwh = RDDcount.filter(lambda x: "whale" in x)
print(MDwh.count())

Result: 1329. Wikipedia says it's 1,685

I think something is wrong because I keep seeing apostrophes, commas, etc in the text. I think the problem exists when splitting sentences and removing unnecessary characters from them. Does anyone see the correct answer for these tasks?

vladsiv · Accepted Answer

I've downloaded the book from the Gutenberg Project: Moby Dick; Or, The Whale by Herman Melville in Plain Text UTF-8.

Delete the obvious additional text from top and bottom and save it to a file: mobydick.

There's a function spark.read.text which reads a text file and creates a new row for each line. The idea is to split rows, explode them and group them by words, after that just perform needed calculations.

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
df = spark.read.text("mobydick")
df = df.filter(F.col("value") != "")  # Remove empty rows

word_counts = (
    df.withColumn("word", F.explode(F.split(F.col("value"), "\s+")))
    .withColumn("word", F.regexp_replace("word", "[^\w]", ""))
    .groupBy("word")
    .count()
    .sort("count", ascending=False)
)

# Top 10
word_counts.show(10)

# All words count
word_counts.agg(F.sum("count").alias("count_all_words")).show()

# Whale count
word_counts.filter(F.col("word").rlike("(?i)whale")).agg(
    F.sum("count").alias("whale_count")
).show()

# Unique count
print("Unique words: ", word_counts.count())

Result:

+----+-----+                                                                    
|word|count|
+----+-----+
|the |13701|
|of  |6551 |
|and |5992 |
|to  |4513 |
|a   |4491 |
|in  |3905 |
|that|2865 |
|his |2462 |
|it  |2089 |
|I   |1942 |
+----+-----+

+---------------+
|count_all_words|
+---------------+
|212469         |
+---------------+

+-----------+
|whale_count|
+-----------+
|1687       |
+-----------+

Unique words:  21837

With more cleaning you can get the exact results. I guess the unique words are a bit off since they require more cleaning and maybe stemming.

Pyspark operations on text, counting words, unique words, most common words

Answers (1)

Related Questions