Reputation: 113
When I try to compare two columns in a Data-frame using comparison operator and not able to get the accurate results so can someone guide me where I am doing wrong
>>>spark.sql("select High,Low from cc where cc.High < cc.Low").show(5)
+--------+-------+
|High |Low |
+--------+-------+
| 14| 4|
| 19| 3|
| 24| 5|
| 46| 6|
| 106| 8|
+--------+-------+
only showing top 5 rows
So the High and Low columns are string datatype.
The value of 14 is greater than value of 4. so the resulted output is wrong so where I am doing wrong???
Upvotes: 1
Views: 3957
Reputation: 43534
the High and Low columns are string datatype.
The comparison is happening lexicographically. In python you can see this is the case via some simple test cases:
print('14' < '4')
#True
print('44' < '4')
#False
In the first case, the comparison is '1' < '4'
, which return True
.
In order to do numeric comparisons, you can convert to integers:
spark.sql(
"select High,Low from cc where CAST(High AS INTEGER) < CAST(Low AS INTEGER)"
).show()
Notes on lexicographical comparison from some Java documentation:
This is the definition of lexicographic ordering. If two strings are different, then either they have different characters at some index that is a valid index for both strings, or their lengths are different, or both. If they have different characters at one or more index positions, let k be the smallest such index; then the string whose character at position k has the smaller value, as determined by using the < operator, lexicographically precedes the other string. In this case,
compareTo
returns the difference of the two character values at position k in the two string -- that is, the value:this.charAt(k)-anotherString.charAt(k)
If there is no index position at which they differ, then the shorter string lexicographically precedes the longer string. In this case, compareTo returns the difference of the lengths of the strings -- that is, the value:
this.length()-anotherString.length()
Upvotes: 2