Create new column in Pyspark Dataframe by filling existing Column

Question

I am trying to create new column in an existing Pyspark DataFrame. Currently the DataFrame looks as follows:

+----+----+---+----+----+----+----+
|Acct| M1D|M1C| M2D| M2C| M3D| M3C|
+----+----+---+----+----+----+----+
|   B|  10|200|null|null|  20|null|
|   C|1000|100|  10|null|null|null|
|   A| 100|200| 200| 200| 300|  10|
+----+----+---+----+----+----+----+

I want to fill null values in column M2C with 0 and create a new column Ratio. My expected output would be as follows:

+------+------+-----+------+------+------+------+-------+
| Acct |  M1D | M1C |  M2D |  M2C |  M3D |  M3C | Ratio |
+------+------+-----+------+------+------+------+-------+
|    B |   10 | 200 | null | null | 20   | null |     0 |
|    C | 1000 | 100 | 10   | null | null | null |     0 |
|    A |  100 | 200 | 200  | 200  | 300  | 10   |   200 |
+------+------+-----+------+------+------+------+-------+

I was trying to achieve my desired results by using following lines of code.

df = df.withColumn('Ratio', df.select('M2C').na.fill(0))

The above line of code resulted in an assertion error as shown below.

AssertionError: col should be Column

The possible solution that I found using this link was to use lit function. I changed my code to

df = df.withColumn('Ratio', lit(df.select('M2C').na.fill(0)))

The above code led to AttributeError: 'DataFrame' object has no attribute '_get_object_id'

How can I achieve my desired output?

Bitswazsky · Accepted Answer

You're doing two things wrong here.

df.select will return a dataframe, not a column.
na.fill will replace null values in all columns, not just in specific columns.

The following code snippet will solve your usecase

from pyspark.sql.functions import col
df = df.withColumn('Ratio', col('M2C')).fillna(0, subset=['Ratio'])

Create new column in Pyspark Dataframe by filling existing Column

Answers (1)

Related Questions