Reputation: 537
I am trying to read a csv file into a pyspark dataframe. I wanted to convert a particular string filed to an integer based on a simple logic.
While reading with pandas this can be done simply with converters
In pandas it can be done like this if status is 'acquired' will be converted to 1 else 0
dataset=pd.read_csv("../input/startup-success-prediction/startup data.csv",\
converters={'status': lambda x: int(x == 'acquired')})
how to do this while reading a pyspark dataframe.
df=spark.read.csv("../input/startup-success-prediction/startup data.csv")
i wanted to add the same type of converter in pyspark
Upvotes: 0
Views: 248
Reputation: 2742
While the Spark API does not offer that capability in the spark.read.csv
function (current options for CSV read/write can be found here), the column conversion can be performed after you read in the data using the when
function with otherwise
:
from pyspark.sql.functions import when
df = spark.read.csv("../input/startup-success-prediction/startup data.csv")
df = df.withColumn('status', when(df['status'] == 'acquired', True).otherwise(False))
Documentation for the when
function can be found here.
Upvotes: 1