oracletohive
oracletohive

Reputation: 15

Oracle to Spark/Hive: how to convert use of "greatest" function to Spark 1.6 dataframe

Table in oracle has 37 columns. names of columns are: year,month,d1,d2....d34. Data in d1..d34 are all integers. There is one more column called maxd which is blank. For each row, I have to find the greatest value out of d1,d2....d34 and put that in maxd column. Oracle code:

UPDATE documents set maxd = GREATEST(d1,d2,d3,d4,d5,d6,d7,d8,d9,d10,d11,d12,d13,d14,d15,d16,d17,d18,d19,d20,d21,d22,d23,d24,d25,d26,d27,d28,d29,d30,d31,d32,d33,d34);

I have created a dataframe containing the exact set of 37 columns in spark1.6. Now, need to write code to calculate the maximum value in d1,d2...d34 for each row and store it in maxd column. Please help.

Upvotes: 1

Views: 451

Answers (1)

Bala
Bala

Reputation: 11274

This uses Spark 1.6

df = sqlContext.createDataFrame([(10, 400, 30)], ['x', 'y', 'z'])

>>> df.show()
+---+---+---+
|  x|  y|  z|
+---+---+---+
| 10|400| 30|
+---+---+---+

>>> df.select(greatest(df.x,df.y,df.z).alias('greatest')).show()
+--------+
|greatest|
+--------+
|     400|
+--------+

Update

I have just learnt that you could pass greatest(*df.columns) instead of df.x, df.y and so on. This is very useful when the no. of columns are huge. So the above could be rewritten as

>>> df.select(greatest(*df.columns).alias('greatest')).show()
+--------+
|greatest|
+--------+
|     400|
+--------+

Upvotes: 0

Related Questions