Reputation: 611
I am not sure is this possible in pyspark. I believe it should be just that i am not winning here :(.
Requirement: Bring any records whose FNAME and LNAME is null or 0
Expected result: Top two rows as result.
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(fileName)
df.show()
+------+-------+------+
| FNAME| LNAME| CITY|
+------+-------+------+
| 0| null| NY|
| null| 0| null|
| Joe| null| LA|
| null| Deon| SA|
| Steve| Mark| null|
+------+-------+------+
colCondition = []
for col in df.columns:
condition = '(df.'+col+'.isNull() | df.'+col+' == 0)'
colCondition.append(condition)
dfWhereConditon = ' & '.join(colList)
This is what I want to achieve:
df.where(dfWhereConditon)
This does not work because dfWhereCondition is treated as string inside where condition. How can I solve this issue or is there a better way of achieving this.
Thanks
Upvotes: 3
Views: 5329
Reputation: 10086
If you want to use a string condition you can use an SQL
filter clause:
condition = ' AND '.join(['('+ col + ' IS NULL OR ' + col + ' = 0)' for col in df.columns])
df.filter(condition)
Upvotes: 2