Reputation: 5924
Using Spark 2.1 (on Databricks), I have a table which has a column of type String as a result of an import from a .CSV file. In a SELECT query against that table, I am attempting to convert that column's value into an Integer before using the column value in a mathematical operation. I have been unable to find the right Spark SQL "function" to do this.
Below is an example of the SQL. "TO_NUMBER" isn't working on either of the two Strings; Sum_GN_POP or Count1:
SELECT name AS geohashPrefix3, TO_NUMBER(Sum_GN_POP) AS totalPopulation, TO_NUMBER(Count1) AS landMass
FROM wayne_geohash3
WHERE (LENGTH(name) = 3)
And it would be helpful if I could find the documentation for this. I will want to do other kinds of conversions (or casts) with other types, too. Any guidance on either or both of this is greatly appreciated.
Upvotes: 11
Views: 119011
Reputation: 41
Haroun's answer about casting in Sql works for me. But notice that, if the number in the string is bigger than integer
, result will be null
. For numbers bigger than integer
(long
or bigint
), the cast should be like:
CAST(Sum_GN_POP as BIGINT)
Upvotes: 4
Reputation: 27373
I would to it using an UDF because Spark's cast will not capture variable overflow:
val parseInt = udf((s:String) => scala.util.Try{Some(s.toInt)}.getOrElse(None))
Seq("100", "10000000000", "1x0")
.toDF("i")
.select(
$"i" cast "int" as "casted_result",
parseInt($"i") as "udf_result"
).show
+-------------+----------+
|casted_result|udf_result|
+-------------+----------+
| 100| 100|
| 1410065408| null|
| null| null|
+-------------+----------+
Upvotes: 3
Reputation: 2424
You can get it as Integer
from the csv
file using the option inferSchema like this :
val df = spark.read.option("inferSchema", true).csv("file-location")
That being said : the inferSchema option do make mistakes sometimes and put the type as String
. if so you can use the cast
operator on Column
Dataframe/Dataset Implemetation :
val df2 = df.withColumn("Count1", $"Count1" cast "Int" as "landMass").withColumn("Count1", $"Sum_GN_POP" cast "Int" as "totalPopulation")
SQL Implemetation :
SELECT name AS geohashPrefix3, CAST(Sum_GN_POP as INT) AS totalPopulation, CAST(Count1 AS INT) AS landMass
FROM wayne_geohash3
WHERE (LENGTH(name) = 3)
Upvotes: 17
Reputation: 5924
Summary:
Apache Spark's SQL has partial compatibility with Apache Hive. So, most SQL that can be written in Hive can be written in Spark SQL.
Detail:
To convert a STRING to a specific numeric type like INT, a cast may be used. The cast consists of wrapping the target with parenthesis and preceding the parenthesis with the type to which it is to be changed. For example, the cast might look like this:
INT(someStringValue)
So, to make the SQL in the original posted question work, it needs to be changed to look like this (replacing the original function named "TO_NUMBER" with "INT"):
SELECT name AS geohashPrefix3, INT(Sum_GN_POP) AS totalPopulation, INT(Count1) AS landMass
FROM wayne_geohash3
WHERE (LENGTH(name) = 3)
Upvotes: 7