Read csv that contains array of string in pyspark

Question

I'm trying to read a csv that has the following data:

name,date,win,stops,cost
a,2020-1-1,true,"[""x"", ""y"", ""z""]", 2.3
b,2021-3-1,true,, 1.3
c,2023-2-1,true,"[""x""]", 0.3
d,2021-3-1,true,"[""z""]", 2.3

using inferSchema results in the stops field spilling over to the next columns and messing up the dataframe

If I give my own schema like:

    schema = StructType([
    StructField('name', StringType()),
    StructField('date', TimestampType()),
    StructField('win', Booleantype()),
    StructField('stops', ArrayType(StringType())),
    StructField('cost', DoubleType())])

results in this exception:

pyspark.sql.utils.AnalysisException: CSV data source does not support array data type.

so how would I properly read the csv without this failure?

Emma · Accepted Answer

Since csv doesn't support array, you need to first read as string, then convert it.

# You need to set escape option to ", since it is not the default escape character (\). 
df = spark.read.csv('file.csv', header=True, escape='"')

df = df.withColumn('stops', F.from_json('stops', ArrayType(StringType())))

Read csv that contains array of string in pyspark

Answers (2)

Related Questions