Reputation: 8395
I have a dataframe with column as String. I wanted to change the column type to Double type in PySpark.
Following is the way, I did:
toDoublefunc = UserDefinedFunction(lambda x: x,DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))
Just wanted to know, is this the right way to do it as while running through Logistic Regression, I am getting some error, so I wonder, is this the reason for the trouble.
Upvotes: 164
Views: 457364
Reputation: 765
use:
df1.select(col('show').cast("Float").alias('label')).show()
or
df1.selectExpr("cast(show AS FLOAT) as label").show()
Upvotes: 1
Reputation: 1376
One issue with other answers (depending on your version of Pyspark) is usage of withColumn
. Performance issues have been observed at least in v2.4.4 (see this thread). The spark docs mention this about withColumn
:
this method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select with the multiple columns at once.
One way to achieve the recommended usage of select
instead in general would be:
from pyspark.sql.types import *
from pyspark.sql import functions as F
cols_to_fix = ['show']
other_cols = [col for col in joindf.columns if not col in cols_to_fix]
joindf = joindf.select(
*other_cols,
F.col('show').cast(DoubleType())
)
Upvotes: 0
Reputation: 933
Preserve the name of the column and avoid extra column addition by using the same name as input column:
from pyspark.sql.types import DoubleType
changedTypedf = joindf.withColumn("show", joindf["show"].cast(DoubleType()))
Upvotes: 81
Reputation: 631
Given answers are enough to deal with the problem but I want to share another way which may be introduced the new version of Spark (I am not sure about it) so given answer didn't catch it.
We can reach the column in spark statement with col("colum_name")
keyword:
from pyspark.sql.functions import col
changedTypedf = joindf.withColumn("show", col("show").cast("double"))
Upvotes: 16
Reputation: 578
PySpark version:
df = <source data>
df.printSchema()
from pyspark.sql.types import *
# Change column type
df_new = df.withColumn("myColumn", df["myColumn"].cast(IntegerType()))
df_new.printSchema()
df_new.select("myColumn").show()
Upvotes: 7
Reputation: 330113
There is no need for an UDF here. Column
already provides cast
method with DataType
instance :
from pyspark.sql.types import DoubleType
changedTypedf = joindf.withColumn("label", joindf["show"].cast(DoubleType()))
or short string:
changedTypedf = joindf.withColumn("label", joindf["show"].cast("double"))
where canonical string names (other variations can be supported as well) correspond to simpleString
value. So for atomic types:
from pyspark.sql import types
for t in ['BinaryType', 'BooleanType', 'ByteType', 'DateType',
'DecimalType', 'DoubleType', 'FloatType', 'IntegerType',
'LongType', 'ShortType', 'StringType', 'TimestampType']:
print(f"{t}: {getattr(types, t)().simpleString()}")
BinaryType: binary
BooleanType: boolean
ByteType: tinyint
DateType: date
DecimalType: decimal(10,0)
DoubleType: double
FloatType: float
IntegerType: int
LongType: bigint
ShortType: smallint
StringType: string
TimestampType: timestamp
and for example complex types
types.ArrayType(types.IntegerType()).simpleString()
'array<int>'
types.MapType(types.StringType(), types.IntegerType()).simpleString()
'map<string,int>'
Upvotes: 282
Reputation: 8395
the solution was simple -
toDoublefunc = UserDefinedFunction(lambda x: float(x),DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))
Upvotes: 1