Parsing pyspark dataframe

Question

I have created a pyspark dataframe as below:

df = spark.createDataFrame([([0.1,0.2], 2), ([0.1], 3), ([0.3,0.3,0.4], 2)], ("a", "b"))

df.show()

+---------------+---+
|              a|  b|
+---------------+---+
|     [0.1, 0.2]|  2|
|          [0.1]|  3|
|[0.3, 0.3, 0.4]|  2|
+---------------+---+

Now, i am trying to parse the column 'a' one row at a time as below:

parse_col = udf(lambda row: [ x for x in row.a], ArrayType(FloatType()))

new_df = df.withColumn("a_new", parse_col(struct([df[x] for x in df.columns if x == 'a'])))

new_df.show()

This works fine.

+---------------+---+---------------+
|              a|  b|          a_new|
+---------------+---+---------------+
|     [0.1, 0.2]|  2|     [0.1, 0.2]|
|          [0.1]|  3|          [0.1]|
|[0.3, 0.3, 0.4]|  2|[0.3, 0.3, 0.4]|
+---------------+---+---------------+

But when i try to format the values, as below:

count_empty_columns = udf(lambda row: ["{:.2f}".format(x) for x in row.a], ArrayType(FloatType()))

new_df = df.withColumn("a_new", count_empty_columns(struct([df[x] for x in df.columns if x == 'a'])))

new_df.show()

It's not working - the values are missing

+---------------+---+-----+
|              a|  b|a_new|
+---------------+---+-----+
|     [0.1, 0.2]|  2|  [,]|
|          [0.1]|  3|   []|
|[0.3, 0.3, 0.4]|  2| [,,]|
+---------------+---+-----+

I am using spark v2.3.1

Any idea what i am doing wrong here ?

Thanks

user11070792 · Accepted Answer

It is simple - types matter. You declare output as array, while formatted string, is not a one. Hence the result is undefined. In other words being a string and a float is mutually exclusive.

If you wanted strings, you should declare column as such

udf(lambda row: ["{:.2f}".format(x) for x in row.a], "array")

otherwise you should consider rounding or using fixed precision numbers.

df.select(df["a"].cast("array")).show()

+------------------+                                                            
|                 a|
+------------------+
|      [0.10, 0.20]|
|            [0.10]|
|[0.30, 0.30, 0.40]|
+------------------+

though these are completely different operations.

Parsing pyspark dataframe

Answers (1)

Related Questions