Substitute a variable's value in PySpark lambda function

Question

How should I be able to use a variable inside a lambda function ?

for a_name in name_field_names:
    results = sqlContext.sql("SELECT * FROM noise_data")
    stringsDS = results.map(lambda p:p.(a_name))

The lambda function is expecting me to give the name of the column, whereas I am giving a variable.

How should I pass the value of the a_name variable to the lambda function ?

zero323 · Accepted Answer

To get a variable from Row by name use bracket notation:

from pyspark.sql import Row

row = Row(a = "foo", b = "bar")
row["a"]

'foo'

or getattr:

getattr(row, "b")

'bar'

You can also skip map and use select:

sqlContext.sql("SELECT * FROM noise_data").select(a_name)

Also remember that Python late bindings. Using variable from the closure inside a function called in a loop is not a good idea. If you want map you should rather capture a_name as an attribute, for example:

from operator import attrgetter

for a_name in name_field_names:
    results = ...
    results.rdd.map(attrgetter(a_name)))

Substitute a variable's value in PySpark lambda function

Answers (1)

Related Questions

Substitute a variable&#39;s value in PySpark lambda function

Answers (1)

Related Questions

Substitute a variable's value in PySpark lambda function