ceteris_paribus
ceteris_paribus

Reputation: 49

spark sql cast function creates column with NULLS

I have the following dataframe and schema in Spark

val df = spark.read.options(Map("header"-> "true")).csv("path")

scala> df show()

+-------+-------+-----+
|   user|  topic| hits|
+-------+-------+-----+
|     om|  scala|  120|
| daniel|  spark|   80|
|3754978|  spark|    1|
+-------+-------+-----+

scala> df printSchema

root
 |-- user: string (nullable = true)
 |--  topic: string (nullable = true)
 |--  hits: string (nullable = true)

I want to change the column hits to integer

I tried this:

scala>    df.createOrReplaceTempView("test")
    val dfNew = spark.sql("select *, cast('hist' as integer) as hist2 from test")

scala> dfNew.printSchema

root
 |-- user: string (nullable = true)
 |--  topic: string (nullable = true)
 |--  hits: string (nullable = true)
 |-- hist2: integer (nullable = true)

but when I print the dataframe the column hist 2 is filled with NULLS

scala> dfNew show()

+-------+-------+-----+-----+
|   user|  topic| hits|hist2|
+-------+-------+-----+-----+
|     om|  scala|  120| null|
| daniel|  spark|   80| null|
|3754978|  spark|    1| null|
+-------+-------+-----+-----+

I also tried this:

scala> val df2 = df.withColumn("hitsTmp",
df.hits.cast(IntegerType)).drop("hits"
).withColumnRenamed("hitsTmp", "hits")

and got this:

<console>:26: error: value hits is not a member of org.apache.spark.sql.DataFram
e

Also tried this:

scala> val df2 = df.selectExpr ("user","topic","cast(hits as int) hits")

and got this:
org.apache.spark.sql.AnalysisException: cannot resolve '`topic`' given input col
umns: [user,  topic,  hits]; line 1 pos 0;
'Project [user#0, 'topic, cast('hits as int) AS hits#22]
+- Relation[user#0, topic#1, hits#2] csv

with

 scala> val df2 = df.selectExpr ("cast(hits as int) hits") 

I get similar error.

Any help will be appreciated. I know this question has been addressed before but I tried 3 different approaches (published here) and none is working.

Thanks.

Upvotes: 1

Views: 26714

Answers (6)

Zach
Zach

Reputation: 958

How do we let the spark cast throw an exception instead of generating all the null values? Do I have to calculate the total number of null values before & after the cast in order to see if the cast is actually successful?

This post How to test datatype conversion during casting is doing that. I wonder if there is a better solution here.

Upvotes: 3

Ambitious Intern
Ambitious Intern

Reputation: 341

I had a similar problem where I was casting Strings to integers but I realized I needed to cast them to longs instead. It was hard to realize this at first since my column's type was an int when I tried to print the type using

print(df.dtypes)

Upvotes: 1

Ban Sun
Ban Sun

Reputation: 71

Try removing the quote around hist if that does not work, then try trimming the column:

dfNew = spark.sql("select *, cast(trim(hist) as integer) as hist2 from test")

Upvotes: 1

Andr&#233; Machado
Andr&#233; Machado

Reputation: 724

I know that this answer probably won't be useful for the OP since it comes with a ~2 year delay. It might however be helpful for someone facing this problem.

Just like you, I had a dataframe with a column of strings which I was trying to cast to integer:

> df.show
+-------+
|     id|
+-------+
|4918088|
|4918111|
|4918154|
   ...

> df.printSchema
root
 |-- id: string (nullable = true)

But after doing the cast to IntegerType the only thing I obtained, just as you did, was a column of nulls:

> df.withColumn("test", $"id".cast(IntegerType))
    .select("id","test")
    .show
+-------+----+
|     id|test|
+-------+----+
|4918088|null|
|4918111|null|
|4918154|null|
      ...

By default if you try to cast a string that contain non-numeric characters to integer the cast of the column won't fail but those values will be set to null as you can see in the following example:

> val testDf = sc.parallelize(Seq(("1"), ("2"), ("3A") )).toDF("n_str")
> testDf.withColumn("n_int", $"n_str".cast(IntegerType))
        .select("n_str","n_int")
        .show
+-----+-----+
|n_str|n_int|
+-----+-----+
|    1|    1|
|    2|    2|
|   3A| null|
+-----+-----+

The thing with our initial dataframe is that, at first sight, when we use the show method, we can't see any non-numeric character. However, if you a row from your dataframe you'll see something different:

> df.first
org.apache.spark.sql.Row = [4?9?1?8?0?8?8??]

Why is this happening? You are probably reading a csv file containing a non-supported encoding.

You can solve this by changing the encoding of the file you are reading. If that is not an option you can also clean each column before doing the type cast. An example:

> val df_cast = df.withColumn("test", regexp_replace($"id", "[^0-9]","").cast(IntegerType))
                  .select("id","test")
> df_cast.show
+-------+-------+
|     id|   test|
+-------+-------+
|4918088|4918088|
|4918111|4918111|
|4918154|4918154|
       ...

> df_cast.printSchema
root
 |-- id: string (nullable = true)
 |-- test: integer (nullable = true)

Upvotes: 2

Sahil Sahay
Sahil Sahay

Reputation: 91

The response is delayed but i was facing the same issue & worked.So thought to put it over here. Might be of help to someone. Try declaring the schema as StructType. Reading from CSV files & providing inferential schema using case class gives weird errors for data types. Although, all the data formats are properly specified.

Upvotes: 0

koiralo
koiralo

Reputation: 23109

You can cast a column to Integer type in following ways

df.withColumn("hits", df("hits").cast("integer"))

Or

data.withColumn("hitsTmp",
      data("hits").cast(IntegerType)).drop("hits").
      withColumnRenamed("hitsTmp", "hits")

Or

data.selectExpr ("user","topic","cast(hits as int) hits")

Upvotes: 2

Related Questions