Reputation: 139
Suppose, I am having a sample dataframe as below:
+-----+----+----+
| col1|col2|col3|
+-----+----+----+
| cat| 10| 1.5|
| dog| 20| 9.0|
| null| 30|null|
|mouse|null|15.3|
+-----+----+----+
I want to fill up the nulls based on the data type. For example for string types I want to fill with 'N/A' and for integer types I want to add 0. Similarly for float I want to add 0.0.
I tried using df.fillna() but then I realized there could be 'N' number of columns so I would like to have a dynamic solution.
Upvotes: 2
Views: 2601
Reputation: 3419
df.dtypes
gives you a tuple of (column_name, data_type)
. It can be used to get the list of string / int / float column names in df
. Subset these columns and fillna()
accordingly.
df = sc.parallelize([['cat', 10, 1.5], ['dog', 20, 9.0],\
[None, 30, None], ['mouse', None, 15.3]])\
.toDF(['col1', 'col2', 'col3'])
string_col = [item[0] for item in df.dtypes if item[1].startswith('string')]
big_int_col = [item[0] for item in df.dtypes if item[1].startswith('bigint')]
double_col = [item[0] for item in df.dtypes if item[1].startswith('double')]
df.fillna('N/A', subset = string_col)\
.fillna(0, subset = big_int_col)\
.fillna(0.0, subset = double_col)\
.show()
Output:
+-----+----+----+
| col1|col2|col3|
+-----+----+----+
| cat| 10| 1.5|
| dog| 20| 9.0|
| N/A| 30| 0.0|
|mouse| 0|15.3|
+-----+----+----+
Upvotes: 6