anika
anika

Reputation: 1

How to apply pandas udf to large matrix dataframe

I am really new to Spark and Pandas. I would like to apply pandas UDF for large matrix of numpy.ndarray that doesn't have any column name. How should I define the input for the UDF function?

This is what I did. row is a row from the cassandra database and 'b2' is a column name for an image inside the database.

def normalize_i(I):
    iI=I
    minI=20#np.min(I)
    maxI=50#np.max(I)
    minO=0
    maxO=255
    iI = (256.0/65536)*iI
    io=(iI-minI)*(((maxO-minO)/(maxI-minI))+minO)
    return io

b2 = cPickle.loads(row.asDict()['b2'], encoding='bytes')
pdf = pd.DataFrame(b2,columns=["x"])
dfb2 = spark.createDataFrame(pdf)
dfb2.select(normalize_i(col("x")))

As expected pd.DataFrame(b2,columns=["x"]) returns an error since b2 is array of array: ValueError: Shape of passed values is (324, 324), indices imply (324, 1)

How should I define the column name for my dataframe, and the input for my function?

Any comment would be much appreciated. Thank you

Upvotes: 0

Views: 192

Answers (1)

can you please detail what pdf should look like?

If b2 is 324x324, i guess you should give 324 columns names:

columns = ['x'+str(i) for i in range(b2.shape[1])]

Upvotes: 0

Related Questions