Reputation: 702
I am attempting to filter out the stop words out of an RDD of words from a .txt
file.
// Creating the RDDs
val input = sc.textFile("../book.txt")
val stopWordsInput = sc.textFile("../stopwords.csv")
val stopWords = stopWordsInput.map(x => x.split(","))
// Create a tuple of test words
val testWords = ("you", "to")
// Split using a regular expression that extracts words
val wordsWithStopWords = input.flatMap(x => x.split("\\W+"))
The code above all makes sense to me and seems to work fine. This is where I'm having trouble.
//Remove the stop words from the list
val words = wordsWithStopWords.filter(x => x != testWords)
This will run but doesn't actually filter out the words contained in the tuple testWords
. I'm not sure how to test the words in wordsWithStopWords
against every word in my tuple testWords
Upvotes: 2
Views: 13119
Reputation: 23
Using subtractByKey:
// Creating the RDDs
val input = sc.textFile("../book.txt")
val stopWordsInput = sc.textFile("../stopwords.csv")
// Split using a regular expression that extracts words from input RDD
val wordsWithInput = input.flatMap(x => x.split("\\W+"))
//Converting above RDDs to lowercase
val lowercaseInput = wordsWithInput.map(x => x.toLowerCase())
val lowercaseStopWordsInput = stopWordsInput.map(x => x.toLowerCase())
//Creating a tuple(word, 1) using map for above RDDs
val tupleInput = lowercaseInput.map(x => (x,1))
val tupleStopWordsInput = lowercaseStopWordsInput.map(x => (x,1))
//using subtractByKey
val tupleWords = tupleInput.subtractByKey(tupleStopWordsInput)
//to have only words in RDD
val words = tupleWords.keys
Upvotes: 0
Reputation: 40380
You can use a Broadcast variable to filter with your stopwords RDD :
// Creating the RDDs
val input = sc.textFile("../book.txt")
val stopWordsInput = sc.textFile("../stopwords.csv")
// Flatten, collect, and broadcast.
val stopWords = stopWordsInput.flatMap(x => x.split(",")).map(_.trim)
val broadcastStopWords = sc.broadcast(stopWords.collect.toSet)
// Split using a regular expression that extracts words
val wordsWithStopWords: RDD[String] = input.flatMap(x => x.split("\\W+"))
wordsWithStopWords.filter(!broadcastStopWords.value.contains(_))
Broadcast variables allow you to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner (in this case as well).
Upvotes: 5
Reputation: 215057
You are testing a String over a a tuple ("you", "to")
, which will always be false.
Here is what you want to try:
val testWords = Set("you", "to")
wordsWithStopWords.filter(!testWords.contains(_))
// Simulating the RDD with a List (works the same with RDD)
List("hello", "to", "yes") filter (!testWords.contains(_))
// res30: List[String] = List(hello, yes)
Upvotes: 4