ranjith reddy
ranjith reddy

Reputation: 481

Spark decode and decompress gzip an embedded base 64 string

My Spark program reads a file that contains gzip compressed string that encoded64. I have to decode and decompress. I used spark unbase64 to decode and generated byte array

bytedf=df.withColumn("unbase",unbase64(col("value")) )

Is there any spark method available in spark that decompresses bytecode?

Upvotes: 6

Views: 3802

Answers (3)

Fadhiil Muhammad
Fadhiil Muhammad

Reputation: 41

I have a similar case, in my case, I do this:

from pyspark.sql.functions import col,unbase64,udf
from gzip import decompress

bytedf=df1.withColumn("unbase",unbase64(col("payload")))
decompress_func = lambda x: decompress(x).decode('utf-8')
udf_decompress = udf(decompress_func)
df2 = bytedf.withColumn('unbase_decompress', udf_decompress('unbase'))

Upvotes: 4

ranjith reddy
ranjith reddy

Reputation: 481

I wrote a udf

def decompress(ip):

    bytecode = base64.b64decode(x)
    d = zlib.decompressobj(32 + zlib.MAX_WBITS)
    decompressed_data = d.decompress(bytecode )
    return(decompressed_data.decode('utf-8'))



decompress = udf(decompress)
decompressedDF = df.withColumn("decompressed_XML",decompress("value"))

Upvotes: 4

Rahul Sharma
Rahul Sharma

Reputation: 5834

Spark example using base64-

import base64
.
.
#decode base 64 string using map operation or you may create udf.
df.map(lambda base64string: base64.b64decode(base64string), <string encoder>)

Read here for detailed python example.

Upvotes: 1

Related Questions