Reputation: 111
I have a some csv data file like this (^ as delmiter):
ID | name | age |
---|---|---|
0 | ||
1 | Mike | 20 |
When I do
df = spark.read.option("delimiter", "^").option("quote","").option("header", "true").option(
"inferSchema", "true").csv(xxxxxxx)
spark will default the 2 column after 0 row to null
df.show():
ID | name | age |
---|---|---|
0 | null | null |
1 | Mike | 20 |
How can I stop pyspark to read the data as null but just empty string?
I have tried add some option in the end
1,option("nullValue", "xxxx").option("treatEmptyValuesAsNulls", False)
2,option("nullValue", None).option("treatEmptyValuesAsNulls", False)
3,option("nullValue", None).option("emptyValue", None)
4,option("nullValue", "xxx").option("emptyValue", "xxx")
But no matter what I do pyspark is still reading the data as null.. Is there a way to make pyspark read the empty string as it is?
Thanks
Upvotes: 0
Views: 1617
Reputation: 4069
It looks that the empty values since Spark Version 2.0.1
are treated as null. A manner to achieve your result is using df.na.fill(...)
:
df = spark.read.csv('your_data_path', sep='^', header=True)
# root
# |-- ID: string (nullable = true)
# |-- name: string (nullable = true)
# |-- age: string (nullable = true)
# Fill all columns
# df = df.na.fill('')
# Fill specific columns
df = df.na.fill('', subset=['name', 'age'])
df.show(truncate=False)
Output
+---+----+---+
|ID |name|age|
+---+----+---+
|0 | | |
|1 |Mike|20 |
+---+----+---+
Upvotes: 2