Reputation: 1
I am really new to Spark and Pandas. I would like to apply pandas UDF for large matrix of numpy.ndarray that doesn't have any column name. How should I define the input for the UDF function?
This is what I did. row is a row from the cassandra database and 'b2' is a column name for an image inside the database.
def normalize_i(I):
iI=I
minI=20#np.min(I)
maxI=50#np.max(I)
minO=0
maxO=255
iI = (256.0/65536)*iI
io=(iI-minI)*(((maxO-minO)/(maxI-minI))+minO)
return io
b2 = cPickle.loads(row.asDict()['b2'], encoding='bytes')
pdf = pd.DataFrame(b2,columns=["x"])
dfb2 = spark.createDataFrame(pdf)
dfb2.select(normalize_i(col("x")))
As expected
pd.DataFrame(b2,columns=["x"])
returns an error since b2 is array of array:
ValueError: Shape of passed values is (324, 324), indices imply (324, 1)
How should I define the column name for my dataframe, and the input for my function?
Any comment would be much appreciated. Thank you
Upvotes: 0
Views: 192
Reputation: 743
can you please detail what pdf should look like?
If b2 is 324x324, i guess you should give 324 columns names:
columns = ['x'+str(i) for i in range(b2.shape[1])]
Upvotes: 0