pyspark: keep a function in the lambda expression

Question

I have the following working code:

def replaceNone(row):
  myList = []
  row_len = len(row)
  for i in range(0, row_len):
    if row[i] is None:
      myList.append("")
    else:
      myList.append(row[i])
  return myList

rdd_out = rdd_in.map(lambda row : replaceNone(row))

Here row is from pyspark.sql import Row

However, it is kind of lengthy and ugly. Is it possible to avoid making the replaceNone function by writing everything in the lambda process directly? Or at least simplify replaceNone()? Thanks!

Erik · Accepted Answer

I'm not sure what your goal is. It seems like you're jsut trying to replace all the None values in each row in rdd_in with empty strings, in which case you can use a list comprehension:

rdd_out = rdd_in.map(lambda row: [r if r is not None else "" for r in row])

The first call to map will make a new list for every element in row and the list comprehension will replace all Nones with empty strings.

This worked on a trivial example (and defined map since it's not defined for a list):

def map(l, f):
    return [f(r) for r in l]

l = [[1,None,2],[3,4,None],[None,5,6]]
l2 = map(l, lambda row: [i if i is not None  else "" for i in row])

print(l2)
>>> [[1, '', 2], [3, 4, ''], ['', 5, 6]]

pyspark: keep a function in the lambda expression

Answers (1)

Related Questions