How to apply pandas udf to large matrix dataframe

Question

I am really new to Spark and Pandas. I would like to apply pandas UDF for large matrix of numpy.ndarray that doesn't have any column name. How should I define the input for the UDF function?

This is what I did. row is a row from the cassandra database and 'b2' is a column name for an image inside the database.

def normalize_i(I):
    iI=I
    minI=20#np.min(I)
    maxI=50#np.max(I)
    minO=0
    maxO=255
    iI = (256.0/65536)*iI
    io=(iI-minI)*(((maxO-minO)/(maxI-minI))+minO)
    return io

b2 = cPickle.loads(row.asDict()['b2'], encoding='bytes')
pdf = pd.DataFrame(b2,columns=["x"])
dfb2 = spark.createDataFrame(pdf)
dfb2.select(normalize_i(col("x")))

As expected pd.DataFrame(b2,columns=["x"]) returns an error since b2 is array of array: ValueError: Shape of passed values is (324, 324), indices imply (324, 1)

How should I define the column name for my dataframe, and the input for my function?

Any comment would be much appreciated. Thank you

How to apply pandas udf to large matrix dataframe

Answers (1)

Related Questions