Reputation: 13
Here's my code for an example:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType
val marketingproj5DF2 = marketingproj5DF.withColumn("ageTmp", 'age.cast(IntegerType)).drop("age").withColumnRenamed("ageTmp","age")
Here's what the DF looks like after:
scala> marketingproj5DF2.show(5)
+--------+----------------+-----------+-------------+-----------+-----------+-----------+--------+-----------+-------+---------+------------+------------+---------
+------------+------------+-------+------+
| age| job| marital| education| default| balance| housing| loan| contact| day| month| duration| campaign| pdays
| previous| poutcome| y|ageTmp|
+--------+----------------+-----------+-------------+-----------+-----------+-----------+--------+-----------+-------+---------+------------+------------+---------
+------------+------------+-------+------+
|"""age""| ""job""|""marital""|""education""|""default""|""balance""|""housing""|""loan""|""contact""|""day""|""month""|""duration""|""campaign""|""pdays""
|""previous""|""poutcome""| ""y"""| null|
| "58| ""management""|""married""| ""tertiary""| ""no""| 2143| ""yes""| ""no""|""unknown""| 5| ""may""| 261| 1| -1
| 0| ""unknown""|""no"""| null|
| "44| ""technician""| ""single""|""secondary""| ""no""| 29| ""yes""| ""no""|""unknown""| 5| ""may""| 151| 1| -1
| 0| ""unknown""|""no"""| null|
| "33|""entrepreneur""|""married""|""secondary""| ""no""| 2| ""yes""| ""yes""|""unknown""| 5| ""may""| 76| 1| -1
| 0| ""unknown""|""no"""| null|
| "47| ""blue-collar""|""married""| ""unknown""| ""no""| 1506| ""yes""| ""no""|""unknown""| 5| ""may""| 92| 1| -1
| 0| ""unknown""|""no"""| null|
+--------+----------------+-----------+-------------+-----------+-----------+-----------+--------+-----------+-------+---------+------------+------------+---------
+------------+------------+-------+------+
only showing top 5 rows
I'm using Spark 1.6 Scala 2.10.5. The first column is my original "age" column and the data is imported from .csv and I couldn't get all of the data into a DF unless I left it as a string, now that I have the "age" column in, I'm trying to convert/cast the field and query against it.
Upvotes: 1
Views: 5354
Reputation: 69
import org.apache.spark.sql
val marketingproj5DF2 = marketingproj5DF.withColumn("age", $"age".cast(sql.types.IntegerType))
Upvotes: -1
Reputation: 28322
The problem is due to the extra "
in the age column. It needs to be removed before casting the column to Int. Also, you do not need to use a temporary column, dropping the original and then renaming the temporary column to the original name. Simply use withColumn()
to overwrite the original.
regexp_replace
can solve the extra "
problem:
val df = Seq("\"58","\"44","\"33","\"47").toDF("age")
val df2 = df.withColumn("age", regexp_replace($"age", "\"", "").cast(IntegerType))
Which will give the desired result:
+---+
|age|
+---+
| 58|
| 44|
| 33|
| 47|
+---+
Upvotes: 2