how to use list comprehension variable names in Pyspark dataframes

Question

I am trying to build a list comprehension that has an iteration built into it. however, I have not been able to get this to work. What am I doing wrong?

Here is a trivial representation of what I am trying to do.

dataframe columns = ["code_number_1", "code_number_2", "code_number_3", "code_number_4", "code_number_5", "code_number_6", "code_number_7", "code_number_8", 

cols = [0,3,4]
result = df.select([code_number_{f"{x}" for x in cols])

Addendum:

my ultimate goal is to do something like this:

col_buckets ["code_1", "code_2", "code_3"]
amt_buckets = ["code_1_amt", "code_2_amt", "code_3_amt" ] 

result = df.withColumn("max_amt_{col_index}", max(df.select(max(**amt_buckets**) for col_indices of amt_buckets if ***any of col indices of col_buckets*** =='01')))

notNull · Accepted Answer

[code_number_{f"{x}" for x in cols] not a valid list comprehension syntax.

Instead try with ["code_number_"+str(x) for x in cols] generates list of column names ['code_number_0', 'code_number_3', 'code_number_4'].

.select accepts strings/columns as arguments to select the matching fields from dataframe.

Example:

df=spark.createDataFrame([("a","b","c","d","e")],["code_number_0","code_number_1","code_number_2","code_number_3","code_number_4"])
cols = [0,3,4]

#passing strings to select
result = df.select(["code_number_"+str(x) for x in cols])

#or passing columns to select
result = df.select([col("code_number_"+str(x)) for x in cols]).show()
result.show()
#+-------------+-------------+-------------+
#|code_number_0|code_number_3|code_number_4|
#+-------------+-------------+-------------+
#|            a|            d|            e|
#+-------------+-------------+-------------+

how to use list comprehension variable names in Pyspark dataframes

Answers (1)

Related Questions