PySpark dataframe - How to pass string variable to df.where() condition

Question

I am not sure is this possible in pyspark. I believe it should be just that i am not winning here :(.

Requirement: Bring any records whose FNAME and LNAME is null or 0

Expected result: Top two rows as result.

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(fileName)
df.show()

+------+-------+------+
| FNAME|  LNAME|  CITY|
+------+-------+------+
|     0|   null|    NY|
|  null|      0|  null|
|   Joe|   null|    LA|
|  null|   Deon|    SA|
| Steve|   Mark|  null|
+------+-------+------+

colCondition = []
for col in df.columns:
    condition = '(df.'+col+'.isNull() | df.'+col+' == 0)'
    colCondition.append(condition)

dfWhereConditon = ' & '.join(colList)

This is what I want to achieve:

df.where(dfWhereConditon)

This does not work because dfWhereCondition is treated as string inside where condition. How can I solve this issue or is there a better way of achieving this.

Thanks

MaFF · Accepted Answer

If you want to use a string condition you can use an SQL filter clause:

condition = ' AND '.join(['('+ col + ' IS NULL OR ' + col + ' = 0)' for col in df.columns])
df.filter(condition)

PySpark dataframe - How to pass string variable to df.where() condition

Answers (1)

Related Questions