Storm
Storm

Reputation: 373

Substitute a variable's value in PySpark lambda function

How should I be able to use a variable inside a lambda function ?

for a_name in name_field_names:
    results = sqlContext.sql("SELECT * FROM noise_data")
    stringsDS = results.map(lambda p:p.(a_name))

The lambda function is expecting me to give the name of the column, whereas I am giving a variable.

How should I pass the value of the a_name variable to the lambda function ?

Upvotes: 0

Views: 899

Answers (1)

zero323
zero323

Reputation: 330073

To get a variable from Row by name use bracket notation:

from pyspark.sql import Row

row = Row(a = "foo", b = "bar")
row["a"]
'foo'

or getattr:

getattr(row, "b")
'bar'

You can also skip map and use select:

sqlContext.sql("SELECT * FROM noise_data").select(a_name)

Also remember that Python late bindings. Using variable from the closure inside a function called in a loop is not a good idea. If you want map you should rather capture a_name as an attribute, for example:

from operator import attrgetter

for a_name in name_field_names:
    results = ...
    results.rdd.map(attrgetter(a_name)))

Upvotes: 1

Related Questions