Reputation: 11244
I have a dataframe with thousands of columns that I would like to pass to greatest
function without specifying column names individually. How can I do that?
As an example, I have df
with 3 columns, that I am passing to greatest
, each by specifying df.x, df.y..
and so on.
df = sqlContext.createDataFrame([(1, 4, 3)], ['x', 'y', 'z'])
>>> df.select(greatest(df.x,df.y,df.z).alias('greatest')).show()
+--------+
|greatest|
+--------+
| 4|
+--------+
In the above example I had only 3 columns, but if it were in thousands, it is impossible to mention each one of them. Couple of things I tried didn't work. I am missing some crucial python...
df.select(greatest(",".join(df.columns)).alias('greatest')).show()
ValueError: greatest should take at least two columns
df.select(greatest(",".join(df.columns),df[0]).alias('greatest')).show()
u"cannot resolve 'x,y,z' given input columns: [x, y, z];"
df.select(greatest([c for c in df.columns],df[0]).alias('greatest')).show()
Method col([class java.util.ArrayList]) does not exist
Upvotes: 0
Views: 1344
Reputation: 35229
greatest
supports positional arguments*
pyspark.sql.functions.greatest(*cols)
(this is why you can greatest(df.x,df.y,df.z)
) so just
df = sqlContext.createDataFrame([(1, 4, 3)], ['x', 'y', 'z'])
df.select(greatest(*df.columns))
* Quoting Python glossary, positional argument is
... an argument that is not a keyword argument. Positional arguments can appear at the beginning of an argument list and/or be passed as elements of an iterable preceded by *. For example, 3 and 5 are both positional arguments in the following calls:
complex(3, 5) complex(*(3, 5))
Furthermore:
Upvotes: 1