Michal
Michal

Reputation: 1895

Proper input for reduce in PySpark

I am trying to discretize some data using spark.

I have data in the following format:

date           zip   amount
2013/04/02    04324  32.2
2013/04/01    23242  1.5
2013/04/02    99343  12

Then I have the following code:

sampleTable = sqlCtx.inferSchema(columns)
sampleTable.registerAsTable("amounts")


exTable = sampleTable.map(lambda p: {"date":p.date,"zip":p.zip,"amount":p.amount}) 

Then I have a function to discretize:

def discretize((key, data), cutoff=0.75):
    result = (data < np.percentile(index,cutoff))
    return result

I will take this result column and later join it with the original data set.

I am trying to perform the action using this statement:

exDiscretized = exTable.map(lambda x: (((dt.datetime.strptime(x.date,'%Y/%m/%d')).year, (dt.datetime.strptime(x.date,'%Y/%m/%d')).month), x.amount)).reduce(discretize).collect()

essentially, I would like a tuple of ((year, month), entire row) so then I can find the 75th percentile for each month and year combination.

I am able to get the map portion to work fine. When I take out the reduce portion, I get the code to work.

When I run the statement with both the map and the reduce, I get the following error:

org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/spark/python/pyspark/worker.py", line 79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File "/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/spark/python/pyspark/serializers.py", line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File "/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/spark/python/pyspark/serializers.py", line 127, in dump_stream
    for obj in iterator:
  File "/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/spark/python/pyspark/serializers.py", line 185, in _batched
for item in iterator:
  File "/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/spark/python/pyspark/rdd.py", line 715, in func
yield reduce(f, iterator, initial)
  File "<stdin>", line 2, in discretize
  File "/usr/local/lib/python2.7/dist-packages/numpy-1.9.1-py2.7-linux-x86_64.egg/numpy/lib/function_base.py", line 3051, in percentile
    q = array(q, dtype=np.float64, copy=True)
 ValueError: setting an array element with a sequence.

I'm not sure what I'm doing wrong. Perhaps it has something to do with the way I am generating a key value pair?

Upvotes: 0

Views: 1126

Answers (1)

Holden
Holden

Reputation: 7452

So I think the root of the problem is that reduce doesn't work the way you are trying to use it. Since you want to bring all of the data for a single key together the function groupByKey is likely the one you are looking for. Here is an example:

input = sc.parallelize([("hi", 1), ("bye", 0), ("hi", 3)])
groupedInput = input.groupByKey()
def top(x):
     data = list(x)
     percentile = np.percentile(data, 0.70)
     return filter(lambda x: x >= percentile , data) 
modifiedGroupedInput = groupedInput.mapValues(top)
modifiedGroupedInput.collect()

results in:

[('bye', [0]), ('hi', [3])]

In general reduceByKey is normally better to use, but since you want to consider all of the elements for each key at the same time to compute the

Upvotes: 1

Related Questions