Feng Chen
Feng Chen

Reputation: 2253

How to change multiple columns' types in pyspark?

I am just studying pyspark. I want to change the column types like this:

df1=df.select(df.Date.cast('double'),df.Time.cast('double'),
          df.NetValue.cast('double'),df.Units.cast('double'))

You can see that df is a data frame and I select 4 columns and change all of them to double. Because of using select, all other columns are ignored.

But, if df has hundreds of columns and I just need to change those 4 columns. I need to keep all the columns. So, how to do it?

Upvotes: 0

Views: 11766

Answers (5)

Robert Robison
Robert Robison

Reputation: 389

This is a lot simpler with the addition of the withColumns method to the Pyspark API:

from pyspark.sql.functions import col


convert_cols = [
    "Date",
    "Time",
    "NetValue",
    "Units",
]

df = df.withColumns({c: col(c).cast('double') for c in convert_cols})

Upvotes: 0

Abbaan Nassar
Abbaan Nassar

Reputation: 1

I understand that you would like to have a non-for-loop answer that preserves the original set of columns whilst only updating a subset. The following should be the answer you were looking for:

from pyspark.sql.functions import col

df = df.select(*(col(c).cast("double").alias(c) for c in subset),*[x for x in df.columns if x not in subset])

where subset is a list of the columnnames you would like to update.

Upvotes: 0

Andres Mitre
Andres Mitre

Reputation: 701

Another way using selectExpr():

df1 = df.selectExpr("cast(Date as double) Date", 
    "cast(NetValueas string) NetValue")
df1.printSchema()

Using withColumn():

from pyspark.sql.types import DoubleType, StringType

df1 = df.withColumn("Date", df["Date"].cast(DoubleType())) \
      .withColumn("NetValueas ", df["NetValueas"].cast(StringType()))
df1.printSchema()

Check types documentation.

Upvotes: 0

ags29
ags29

Reputation: 2696

Try this:

from pyspark.sql.functions import col

df = df.select([col(column).cast('double') for column in df.columns])

Upvotes: 7

Ranga Vure
Ranga Vure

Reputation: 1932

for c in df.columns:
    # add condition for the cols to be type cast
    df=df.withColumn(c, df[c].cast('double'))

Upvotes: 4

Related Questions