Pyspark: global name is not defined

Question

My data after left outer join is in the following format:

    # (u'session_id', ((u'prod_id', u'user_id'), (u'prod_label', u'user_id')))

    # (u'u'session_id', ((u'20133', u'129001032'), None))
    # (u'u'session_id', ((u'2024574', u'61370212'), (u'Loc1', u'61370212')))

I want data in the following format now: (user_id, prod_id, prod_label)

When I do this to get that, I get the following error:

result_rdd = rdd1.map(lambda (session_id, (prod_id,  user_id), (prod_label, user_id)): user_id, prod_id, prod_label)


NameError: global name 'prod_id' is not defined

zero323 · Accepted Answer

It is simply not a valid syntax for lambda expression. If you want to return a tuple it has to be done with full parentheses:

rdd1.map(lambda (session_id, (prod_id,  user_id_1), (prod_label, user_id_2)): 
    (user_id, prod_id, prod_label))

Also keep in mind that tuple parameter unpacking is not portable and that duplicate parameter names are not allowed and will result in `SyntaxError.

Pyspark: global name is not defined

Answers (1)

Related Questions