yi wang
yi wang

Reputation: 13

How to use fillna function with column having period in pyspark

I tried to run fillna to insert nan in column with special character "."

df = spark.createDataFrame(
    [(None, None), ('U1', None), ('U3', 1.0)], 
    ['USER_ID', 'a.b']
)

I tried

df = df.fillna({"`a.b`": float("nan")})

also

df = df.fillna({"a.b": float("nan")})

Both of them doesn't work, who have experience on this?

Upvotes: 1

Views: 2185

Answers (2)

pault
pault

Reputation: 43524

It seems that there is a limitation of pyspark.sql.DataFrame.fillna() which doesn't allow you to specify column names with periods in them when you use the value parameter as a dictionary.

From the docs:

value – int, long, float, string, bool or dict. Value to replace null values with. If the value is a dict, then subset is ignored and value must be a mapping from column name (string) to replacement value. The replacement value must be an int, long, float, boolean, or string.

You should be able to use fillna using the other syntax that specifies both the value and subset parameters.

df.fillna(value=float("nan"), subset=["a.b"]).show()
#+-------+---+
#|USER_ID|a.b|
#+-------+---+
#|   null|NaN|
#|     U1|NaN|
#|     U3|1.0|
#+-------+---+

The above worked for me in Spark 2.4, but I don't see why it should not work on older version.

If you are still having trouble, another way to do this would be to temporarily rename your columns, call fillna, and then rename the columns back to the original values:

Here I will rename the columns to replace the "." with the string "_DOT_", which I deliberately picked to avoid conflicting with existing substrings in other column names.

df.toDF(*[c.replace(".", "_DOT_") for c in df.columns])\
    .fillna({"a_DOT_b": float("nan")})\
    .toDF(*df.columns)\
    .show()
#+-------+---+
#|USER_ID|a.b|
#+-------+---+
#|   null|NaN|
#|     U1|NaN|
#|     U3|1.0|
#+-------+---+

Upvotes: 1

Lamanus
Lamanus

Reputation: 13581

This is working.

df = spark.createDataFrame([(None, None), ('U1', None), ('U3', 1.0)], ['USER_ID', 'a.b'])
df = df.fillna(float("nan"), ['`a.b`'])
df.show(10, False)

+-------+---+
|USER_ID|a.b|
+-------+---+
|null   |NaN|
|U1     |NaN|
|U3     |1.0|
+-------+---+

Upvotes: 0

Related Questions