Reputation: 1318
I am trying to create new column in an existing Pyspark DataFrame. Currently the DataFrame looks as follows:
+----+----+---+----+----+----+----+
|Acct| M1D|M1C| M2D| M2C| M3D| M3C|
+----+----+---+----+----+----+----+
| B| 10|200|null|null| 20|null|
| C|1000|100| 10|null|null|null|
| A| 100|200| 200| 200| 300| 10|
+----+----+---+----+----+----+----+
I want to fill null values in column M2C
with 0
and create a new column Ratio
. My expected output would be as follows:
+------+------+-----+------+------+------+------+-------+
| Acct | M1D | M1C | M2D | M2C | M3D | M3C | Ratio |
+------+------+-----+------+------+------+------+-------+
| B | 10 | 200 | null | null | 20 | null | 0 |
| C | 1000 | 100 | 10 | null | null | null | 0 |
| A | 100 | 200 | 200 | 200 | 300 | 10 | 200 |
+------+------+-----+------+------+------+------+-------+
I was trying to achieve my desired results by using following lines of code.
df = df.withColumn('Ratio', df.select('M2C').na.fill(0))
The above line of code resulted in an assertion error
as shown below.
AssertionError: col should be Column
The possible solution that I found using this link was to use lit
function.
I changed my code to
df = df.withColumn('Ratio', lit(df.select('M2C').na.fill(0)))
The above code led to AttributeError: 'DataFrame' object has no attribute '_get_object_id'
How can I achieve my desired output?
Upvotes: 0
Views: 815
Reputation: 4698
You're doing two things wrong here.
df.select
will return a dataframe, not a column.na.fill
will replace null values in all columns, not just in specific columns.The following code snippet will solve your usecase
from pyspark.sql.functions import col
df = df.withColumn('Ratio', col('M2C')).fillna(0, subset=['Ratio'])
Upvotes: 2