Jake Henningsgaard
Jake Henningsgaard

Reputation: 702

Filter stop words in Spark

I am attempting to filter out the stop words out of an RDD of words from a .txt file.

// Creating the RDDs
val input = sc.textFile("../book.txt")
val stopWordsInput = sc.textFile("../stopwords.csv")
val stopWords = stopWordsInput.map(x => x.split(","))

// Create a tuple of test words
val testWords = ("you", "to")

// Split using a regular expression that extracts words
val wordsWithStopWords = input.flatMap(x => x.split("\\W+"))

The code above all makes sense to me and seems to work fine. This is where I'm having trouble.

//Remove the stop words from the list
val words = wordsWithStopWords.filter(x => x != testWords)

This will run but doesn't actually filter out the words contained in the tuple testWords. I'm not sure how to test the words in wordsWithStopWords against every word in my tuple testWords

Upvotes: 2

Views: 13119

Answers (3)

game pirate
game pirate

Reputation: 23

Using subtractByKey:

// Creating the RDDs
val input = sc.textFile("../book.txt")
val stopWordsInput = sc.textFile("../stopwords.csv")

// Split using a regular expression that extracts words from input RDD
val wordsWithInput = input.flatMap(x => x.split("\\W+"))


//Converting above RDDs to lowercase
val lowercaseInput = wordsWithInput.map(x => x.toLowerCase())
val lowercaseStopWordsInput = stopWordsInput.map(x => x.toLowerCase())

//Creating a tuple(word, 1) using map for above RDDs
val tupleInput = lowercaseInput.map(x => (x,1))
val tupleStopWordsInput = lowercaseStopWordsInput.map(x => (x,1))

//using subtractByKey
val tupleWords = tupleInput.subtractByKey(tupleStopWordsInput)

//to have only words in RDD
val words = tupleWords.keys

Upvotes: 0

eliasah
eliasah

Reputation: 40380

You can use a Broadcast variable to filter with your stopwords RDD :

// Creating the RDDs
val input = sc.textFile("../book.txt")
val stopWordsInput = sc.textFile("../stopwords.csv")

// Flatten, collect, and broadcast.
val stopWords = stopWordsInput.flatMap(x => x.split(",")).map(_.trim)
val broadcastStopWords = sc.broadcast(stopWords.collect.toSet)

// Split using a regular expression that extracts words
val wordsWithStopWords: RDD[String] = input.flatMap(x => x.split("\\W+"))
wordsWithStopWords.filter(!broadcastStopWords.value.contains(_))

Broadcast variables allow you to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner (in this case as well).

Upvotes: 5

akuiper
akuiper

Reputation: 215057

You are testing a String over a a tuple ("you", "to"), which will always be false.

Here is what you want to try:

val testWords = Set("you", "to")
wordsWithStopWords.filter(!testWords.contains(_))

// Simulating the RDD with a List (works the same with RDD)
List("hello", "to", "yes") filter (!testWords.contains(_))
// res30: List[String] = List(hello, yes)

Upvotes: 4

Related Questions