Jakub Mitura
Jakub Mitura

Reputation: 177

can not cast values in spark scala dataframe

I am trying to parse the data from numbers

Enviroment: DataBricks Scala 2.12 Spark 3.1

I had chosen columns that were incorrectly parsed as Strings the reason is that sometimes numbers were written with coma sometimes with dot.

I am trying to first replace all commas to dots parse it as floats, create schema with type of floating numbers and recreate the dataframe but it does not work.

import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, FloatType};
import org.apache.spark.sql.{Row, SparkSession}
import sqlContext.implicits._ 
//temp is a dataframe with data that I included below
val jj = temp.collect().map(row=> Row(row.toSeq.map(it=> if(it==null) {null} else {it.asInstanceOf[String].replace( ",", ".").toFloat }) ))
val schemaa = temp.columns.map(colN=> (StructField(colN, FloatType, true)))
val newDatFrame = spark.createDataFrame(jj,schemaa)

Data screen enter image description here

CSV

Podana aktywność,CRP(6 mcy),WBC(6 mcy),SUV (max) w miejscu zapalenia,SUV (max) tła,tumor to background ratio
218,72,"15,2",16,"1,8","8,888888889"
"199,7",200,"16,5","21,5","1,4","15,35714286"
270,42,"11,17","7,6","2,4","3,166666667"
200,226,"29,6",9,"2,8","3,214285714"
200,45,"13,85",17,"2,1","8,095238095"
300,null,"37,8","6,19","2,5","2,476"
290,175,"7,35",9,"2,4","3,75"
279,160,"8,36",13,2,"6,5"
202,24,10,"6,7","2,6","2,576923077"
334,"22,9","8,01",12,"2,4",5
"200,4",null,"25,56",7,"2,4","2,916666667"
198,102,"8,36","7,4","1,8","4,111111111"
"211,6","26,7","10,8","4,2","1,6","2,625"
205,null,null,"9,7","2,07","4,685990338"
326,300,18,14,"2,4","5,833333333"
270,null,null,15,"2,5",6
258,null,null,6,"2,5","2,4"
300,197,"13,5","12,5","2,6","4,807692308"
200,89,"20,9","4,8","1,7","2,823529412"
"201,7",28,null,11,"1,8","6,111111111"
198,9,13,9,2,"4,5"
264,null,"20,3",12,"2,5","4,8"
230,31,"13,3","4,8","1,8","2,666666667"
284,107,"9,92","5,8","1,49","3,89261745"
252,270,null,8,"1,56","5,128205128"
266,null,null,"10,4","1,95","5,333333333"
242,null,null,"14,7",2,"7,35"
259,null,null,"10,01","1,65","6,066666667"
224,null,null,"4,2","1,86","2,258064516"
306,148,10.3,11,1.9,"0,0002488406289"
294,null,5.54,"9,88","1,93","5,119170984"

Upvotes: 0

Views: 939

Answers (1)

mck
mck

Reputation: 42422

You can map the columns using Spark SQL regexp_replace. collect is not needed and will not give a good performance. You might also want to use double instead of float because some entries have many decimal places.

val new_df = df.select(
    df.columns.map(
        c => regexp_replace(col(c), ",", ".").cast("double").as(c)
    ):_*
)

Upvotes: 1

Related Questions