Reputation: 75
Coming from a SQL background here.. I'm using df1 = spark.read.jdbc
to load data from Azure sql into a dataframe. I am trying to filter the data to exclude rows meeting the following criteria:
df2 = df1.filter("ItemID <> '75' AND Code1 <> 'SL'")
The dataframe ends up being empty but when i run equivalent SQL it is correct. When i change it to
df2 = df1.filter("ItemID **=** '75' AND Code1 **=** 'SL'")
it produces the rows i want to filter out.
What is the best way to remove the rows meeting the criteria, so they can be pushed to a SQL server? Thank you
Upvotes: 2
Views: 115
Reputation: 6323
In SQL world, <>
means Checks if the value of two operands are equal or not, if values are not equal then condition becomes true.
The equivalent of it in spark sql is !=
. Thus your sql condition inside filter becomes-
# A != B -> TRUE if expression A is not equivalent to expression B; otherwise FALSE
df2 = df1.filter("ItemID != '75' AND Code1 != 'SL'")
=
has same meaning in spark sql as ansi sql
df2 = df1.filter("ItemID = '75' AND Code1 = 'SL'")
Upvotes: 2
Reputation: 31460
Use &
operator with !=
in pyspark.
<>
deprecated from python3.
Example:
df=spark.createDataFrame([(75,'SL'),(90,'SL1')],['ItemID','Code1'])
df.filter((col("ItemID") != '75') & (col("code1") != 'SL') ).show()
#or using negation
df.filter(~(col("ItemID") == '75') & ~(col("Code1") == 'SL') ).show()
#+------+-----+
#|ItemID|Code1|
#+------+-----+
#| 90| SL1|
#+------+-----+
Upvotes: 0