Reputation: 8172
When I try
tokens = cleaned_book(flatMap(normalize_tokenize))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'flatMap' is not defined
where
cleaned_book.count()
65744
and
def normalize_tokenize(line):
... return re.sub('\s+', ' ', line).strip().lower().split(' ')
On the other side
sc.parallelize([3,4,5]).flatMap(lambda x: range(1,x)).collect()
works fine from the same Pyspark shell
[1, 2, 1, 2, 3, 1, 2, 3, 4]
Why do I have NameError?
Upvotes: 0
Views: 409
Reputation: 18023
OK, here is a Scala example with tokenizer that leads me to think you are looking at it wrongly.
def tokenize(f: RDD[String]) = {
f.map(_.split(" "))
}
val dfsFilename = "/FileStore/tables/some.txt"
val readFileRDD = spark.sparkContext.textFile(dfsFilename)
val wcounts = tokenize(spark.sparkContext.textFile(dfsFilename)).flatMap(x => x).map(word=>(word, 1)).reduceByKey(_ + _)
wcounts.collect()
This works fine, you need the functional . aspect, thus .flatMap and in this sequence. The inline approach I find easier, but I note the comment also alludes to the .flatMap.
Upvotes: 1