Reputation: 316
I'm new to Python. I'm also new to pysaprk. I'm trying to run a code line that takes (kv[0], kv[1]) and then run an ngrams() function on kv[1].
Also here is the sample layout of the mentions
data that the code works on:
Out[12]:
[{'_id': u'en.wikipedia.org/wiki/Kamchatka_Peninsula',
'source': 'en.wikipedia.org/wiki/Warthead_sculpin',
'span': (100, 119),
'text': u' It is native to the northern.'},
{'_id': u'en.wikipedia.org/wiki/Warthead_sculpin',
'source': 'en.wikipedia.org/wiki/Warthead_sculpin',
'span': (4, 20),
'text': u'The warthead sculpin ("Myoxocephalus niger").'}]
This is the code that I'm working with:
def build(self, mentions, idfs):
m = mentions\
.map(lambda (source, target, span, text): (target, text))
.flatMapValues(lambda v: ngrams(v, self.max_ngram))
.map(lambda v: (v, 1))
.reduceByKey(add)\
How should the data from the previous step be formulated to resolve this error?? Any help or guidance will be truly appreciated.
I'm using python 2.7 and pyspark 2.3.0.
Thank you,
Upvotes: 2
Views: 196
Reputation:
mapValues
can be applied only on a RDD of (key, value)
pairs (RDD
where each element is a tuple
of length
equal to 2, or some object that behaves as one - How to determine if object is a valid key-value pair in PySpark)
You data is a dictionary, so it doesn't qualify. It is not clear what you expect there, but you suspect you want:
from operator import itemgetter
(mentions
.map(itemgetter("_id", "text"))
.flatMapValues(lambda v: ngrams(v, self.max_ngram))
.map(lambda v: (v, 1)))
Upvotes: 1