JohnWallz
JohnWallz

Reputation: 13

Cast from String to Int in DF results in null instead of numbers

Here's my code for an example:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType
val marketingproj5DF2 = marketingproj5DF.withColumn("ageTmp", 'age.cast(IntegerType)).drop("age").withColumnRenamed("ageTmp","age")

Here's what the DF looks like after:

scala> marketingproj5DF2.show(5)

+--------+----------------+-----------+-------------+-----------+-----------+-----------+--------+-----------+-------+---------+------------+------------+---------
+------------+------------+-------+------+
|     age|             job|    marital|    education|    default|    balance|    housing|    loan|    contact|    day|    month|    duration|    campaign|    pdays
|    previous|    poutcome|      y|ageTmp|
+--------+----------------+-----------+-------------+-----------+-----------+-----------+--------+-----------+-------+---------+------------+------------+---------
+------------+------------+-------+------+
|"""age""|         ""job""|""marital""|""education""|""default""|""balance""|""housing""|""loan""|""contact""|""day""|""month""|""duration""|""campaign""|""pdays""
|""previous""|""poutcome""| ""y"""|  null|
|     "58|  ""management""|""married""| ""tertiary""|     ""no""|       2143|    ""yes""|  ""no""|""unknown""|      5|  ""may""|         261|           1|       -1
|           0| ""unknown""|""no"""|  null|
|     "44|  ""technician""| ""single""|""secondary""|     ""no""|         29|    ""yes""|  ""no""|""unknown""|      5|  ""may""|         151|           1|       -1
|           0| ""unknown""|""no"""|  null|
|     "33|""entrepreneur""|""married""|""secondary""|     ""no""|          2|    ""yes""| ""yes""|""unknown""|      5|  ""may""|          76|           1|       -1
|           0| ""unknown""|""no"""|  null|
|     "47| ""blue-collar""|""married""|  ""unknown""|     ""no""|       1506|    ""yes""|  ""no""|""unknown""|      5|  ""may""|          92|           1|       -1
|           0| ""unknown""|""no"""|  null|
+--------+----------------+-----------+-------------+-----------+-----------+-----------+--------+-----------+-------+---------+------------+------------+---------
+------------+------------+-------+------+
only showing top 5 rows

I'm using Spark 1.6 Scala 2.10.5. The first column is my original "age" column and the data is imported from .csv and I couldn't get all of the data into a DF unless I left it as a string, now that I have the "age" column in, I'm trying to convert/cast the field and query against it.

Upvotes: 1

Views: 5354

Answers (2)

Siddharth Arekar
Siddharth Arekar

Reputation: 69

import org.apache.spark.sql

val marketingproj5DF2 = marketingproj5DF.withColumn("age", $"age".cast(sql.types.IntegerType))

Upvotes: -1

Shaido
Shaido

Reputation: 28322

The problem is due to the extra " in the age column. It needs to be removed before casting the column to Int. Also, you do not need to use a temporary column, dropping the original and then renaming the temporary column to the original name. Simply use withColumn() to overwrite the original.

regexp_replace can solve the extra " problem:

val df = Seq("\"58","\"44","\"33","\"47").toDF("age")
val df2 = df.withColumn("age", regexp_replace($"age", "\"", "").cast(IntegerType))

Which will give the desired result:

+---+
|age|
+---+
| 58|
| 44|
| 33|
| 47|
+---+

Upvotes: 2

Related Questions