Reputation: 845
I have an encoded dataframe and I managed to get it decoded using following code in PySpark. Is there any simple way where I can have an additional column in the dataframe itself through Scala/PySpark?
import base64
import numpy as np
df = spark.read.parquet("file_path")
encodedColumn = base64.decodestring(df.take(1)[0].column2)
t1 = np.frombuffer(encodedColumn ,dtype='<f4')
I looked up multiple similar questions, but couldnt get them to work.
Edit: Got it working with help from a colleague.
def binaryToFloatArray(stringValue: String): Array[Float] = {
val t:Array[Byte] = Base64.getDecoder().decode(stringValue)
val b = ByteBuffer.wrap(t).order(ByteOrder.LITTLE_ENDIAN).asFloatBuffer()
val copy = new Array[Float](2048)
b.get(copy)
return copy
}
val binaryToFloatArrayUDF = udf(binaryToFloatArray _)
val finalResultDf = dftest.withColumn("myFloatArray", binaryToFloatArrayUDF(col("_2"))).drop("_2")
Upvotes: 2
Views: 7364
Reputation: 1944
You have base64 and unbase64 functions for this.
You could
from pyspark.sql.functions import unbase64,base64
got = spark.createDataFrame([(1, "Jon"), (2, "Danny"), (3, "Tyrion")], ("id", "name"))
+---+------+
| id| name|
+---+------+
| 1| Jon|
| 2| Danny|
| 3|Tyrion|
+---+------+
encoded_got = got.withColumn('encoded_base64_name', base64(got.name))
+---+------+-------------------+
| id| name|encoded_base64_name|
+---+------+-------------------+
| 1| Jon| Sm9u|
| 2| Danny| RGFubnk=|
| 3|Tyrion| VHlyaW9u|
+---+------+-------------------+
decoded_got = encoded_got.withColumn('decoded_base64', unbase64(encoded_got.encoded_base64).cast("string"))
# Need to use cast("string") to convert from binary to string
+---+------+--------------+--------------+
| id| name|encoded_base64|decoded_base64|
+---+------+--------------+--------------+
| 1| Jon| Sm9u| Jon|
| 2| Danny| RGFubnk=| Danny|
| 3|Tyrion| VHlyaW9u| Tyrion|
+---+------+--------------+--------------+
Upvotes: 9