Using Converters while reading a pyspark dataframe

Question

I am trying to read a csv file into a pyspark dataframe. I wanted to convert a particular string filed to an integer based on a simple logic.

While reading with pandas this can be done simply with converters

In pandas it can be done like this if status is 'acquired' will be converted to 1 else 0

  dataset=pd.read_csv("../input/startup-success-prediction/startup data.csv",\
       converters={'status': lambda x: int(x == 'acquired')})

how to do this while reading a pyspark dataframe.

   df=spark.read.csv("../input/startup-success-prediction/startup data.csv")

i wanted to add the same type of converter in pyspark

danielcahall · Accepted Answer

While the Spark API does not offer that capability in the spark.read.csv function (current options for CSV read/write can be found here), the column conversion can be performed after you read in the data using the when function with otherwise:

   from pyspark.sql.functions import when
   df = spark.read.csv("../input/startup-success-prediction/startup data.csv")
   df = df.withColumn('status', when(df['status'] == 'acquired', True).otherwise(False))

Documentation for the when function can be found here.

Using Converters while reading a pyspark dataframe

Answers (1)

Related Questions