R1tschY
R1tschY

Reputation: 3212

Extract byte from Spark BinaryType

I have a table with a binary column of type BinaryType:

>>> df.show(3)
+--------+--------------------+
|       t|               bytes|
+--------+--------------------+
|0.145533|[10 50 04 89 00 3...|
|0.345572|[60 94 05 89 80 9...|
|0.545574|[99 50 68 89 00 7...|
+--------+--------------------+
only showing top 3 rows
>>> df.schema
StructType(List(StructField(t,DoubleType,true),StructField(bytes,BinaryType,true)))

If I extract the first byte of the binary, I get an exception from Spark:

>>> df.select(n["t"], df["bytes"].getItem(0)).show(3)
AnalysisException: u"Can't extract value from bytes#477;"

A cast to ArrayType(ByteType) also didn't work:

>>> df.select(n["t"], df["bytes"].cast(ArrayType(ByteType())).getItem(0)).show(3)
AnalysisException: u"cannot resolve '`bytes`' due to data type mismatch: cannot cast BinaryType to ArrayType(ByteType,true) ..."

How can I extract the bytes?

Upvotes: 1

Views: 7784

Answers (2)

Emer
Emer

Reputation: 3824

An alternative is to use the native API function substring (docs) which can "slice" the Binary Type using position and length arguments.

Here is a demo following the accepted answer's example:

df.select("*", 
          f.substring("bytes", 2, 1), 
          f.substring("bytes", 2, 1).cast("string")).show()

+---+----------+----------------------+--------------------------------------+
|  t|     bytes|substring(bytes, 2, 1)|CAST(substring(bytes, 2, 1) AS STRING)|
+---+----------+----------------------+--------------------------------------+
|  1|[0A 32 04]|                  [32]|                                     2|
|  2|[0A 32 04]|                  [32]|                                     2|
+---+----------+----------------------+--------------------------------------+

Upvotes: 2

Daniel de Paula
Daniel de Paula

Reputation: 17872

You can make a simple udf for that:

from pyspark.sql import functions as f

a = bytearray([10, 50, 04])
df = sqlContext.createDataFrame([(1, a), (2, a)], ("t", "bytes"))
df.show()
+---+----------+
|  t|     bytes|
+---+----------+
|  1|[0A 32 04]|
|  2|[0A 32 04]|
+---+----------+
u = f.udf(lambda a: a[0])
df.select(u(df['bytes']).alias("first")).show()
+-----+
|first|
+-----+
|   10|
|   10|
+-----+

Edit

If you want the position of the extraction to be a parameter, you could do some currying:

func = lambda i: lambda a: a[i]
my_udf = lambda i: f.udf(func(i))

df.select(my_udf(2)(df['bytes']).alias("last")).show()

+----+
|last|
+----+
|   4|
|   4|
+----+

Upvotes: 5

Related Questions