Abhra Ray
Abhra Ray

Reputation: 139

Fill null values in pyspark dataframe based on data type of column

Suppose, I am having a sample dataframe as below:

+-----+----+----+
| col1|col2|col3|
+-----+----+----+
|  cat|  10| 1.5|
|  dog|  20| 9.0|
| null|  30|null|
|mouse|null|15.3|
+-----+----+----+

I want to fill up the nulls based on the data type. For example for string types I want to fill with 'N/A' and for integer types I want to add 0. Similarly for float I want to add 0.0.

I tried using df.fillna() but then I realized there could be 'N' number of columns so I would like to have a dynamic solution.

Upvotes: 2

Views: 2601

Answers (1)

Cena
Cena

Reputation: 3419

df.dtypes gives you a tuple of (column_name, data_type). It can be used to get the list of string / int / float column names in df. Subset these columns and fillna() accordingly.

df = sc.parallelize([['cat', 10, 1.5], ['dog', 20, 9.0],\
                 [None, 30, None], ['mouse', None, 15.3]])\
                 .toDF(['col1', 'col2', 'col3'])

string_col = [item[0] for item in df.dtypes if item[1].startswith('string')]
big_int_col = [item[0] for item in df.dtypes if item[1].startswith('bigint')]
double_col = [item[0] for item in df.dtypes if item[1].startswith('double')]

df.fillna('N/A', subset = string_col)\
        .fillna(0, subset = big_int_col)\
        .fillna(0.0, subset = double_col)\
        .show()

Output:

+-----+----+----+
| col1|col2|col3|
+-----+----+----+
|  cat|  10| 1.5|
|  dog|  20| 9.0|
|  N/A|  30| 0.0|
|mouse|   0|15.3|
+-----+----+----+

Upvotes: 6

Related Questions