Inconsistent results due to Spark's lazy evaluation

Question

I have a simple pyspark code:

l = [
    {'userId': 'u1', 'itemId': 'a1', 'click': 1},
    {'userId': 'u1', 'itemId': 'a2', 'click': 0},
    {'userId': 'u2', 'itemId': 'b1', 'click': 1},
    {'userId': 'u2', 'itemId': 'b2', 'click': 1},
]

d = sc.parallelize(l)

Essentially, 1st user clicked on one of two items, while 2nd user clicked on both items.

Let's group by the events by userId and process those in a function.

def fun((user_id, events)):
    events = list(events)
    user_id = events[0]['userId']

    clicked = set()
    not_clicked = set()

    for event in events:
        item_id = event['itemId']
        if event['click']==1:
            clicked.add(item_id)
        else:
            not_clicked.add(item_id)

    ret = {'userId': user_id, 'click': 1}
    for item_id in clicked:
        ret['itemId'] = item_id
        yield ret

    ret['click'] = 0
    for item_id in not_clicked:
        ret['itemId'] = item_id
        yield ret

d1 = d\
    .map(lambda obj: (obj['userId'], obj))\
    .groupByKey()\
    .flatMap(fun)

d1.collect()

This is what I get:

[{'click': 1, 'itemId': 'a1', 'userId': 'u1'},
 {'click': 0, 'itemId': 'a2', 'userId': 'u1'},
 {'click': 1, 'itemId': 'b1', 'userId': 'u2'},
 {'click': 0, 'itemId': 'b2', 'userId': 'u2'}]

The result for user u2 is incorrect.

Can someone explain why this is happening and what is the best practice to prevent this?

Thanks.

Inconsistent results due to Spark's lazy evaluation

Answers (1)

Related Questions

Inconsistent results due to Spark&#39;s lazy evaluation

Answers (1)

Related Questions

Inconsistent results due to Spark's lazy evaluation