Reputation: 351
I'm trying to read a csv that has the following data:
name,date,win,stops,cost
a,2020-1-1,true,"[""x"", ""y"", ""z""]", 2.3
b,2021-3-1,true,, 1.3
c,2023-2-1,true,"[""x""]", 0.3
d,2021-3-1,true,"[""z""]", 2.3
using inferSchema results in the stops field spilling over to the next columns and messing up the dataframe
If I give my own schema like:
schema = StructType([
StructField('name', StringType()),
StructField('date', TimestampType()),
StructField('win', Booleantype()),
StructField('stops', ArrayType(StringType())),
StructField('cost', DoubleType())])
results in this exception:
pyspark.sql.utils.AnalysisException: CSV data source does not support array<string> data type.
so how would I properly read the csv without this failure?
Upvotes: 5
Views: 5377
Reputation: 9308
Since csv doesn't support array, you need to first read as string, then convert it.
# You need to set escape option to ", since it is not the default escape character (\).
df = spark.read.csv('file.csv', header=True, escape='"')
df = df.withColumn('stops', F.from_json('stops', ArrayType(StringType())))
Upvotes: 7
Reputation: 1033
I guess this is what you are looking for:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
dataframe = spark.read.options(header='True', delimiter=",").csv("file_name.csv")
dataframe.printSchema()
Let me know if it helps
Upvotes: -2