Reputation: 2423
I currently have a pyspark dataframe and one of the columns contains rows of numbers that I would like to lookup using a function I wrote to return a string of information. I know the simple way would be to use withCoulmn and define a UDF to create a new column from the old one, however something about the way my function makes it unable to register it as a UDF.Is it possible for me to create a new dataframe with my new column based on the values of the old column without making a UDF?
Upvotes: 1
Views: 2000
Reputation: 3501
You could go from dataframe to rdd and then back to dataframe. For example, suppose you have a dataframe with two columns - 'col1' and 'col2':
df = sqlContext.createDataFrame([[1,2],[3,4],[5,6]],['col1','col2'])
df.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
| 3| 4|
| 5| 6|
+----+----+
You could convert to an rdd, run it through a map, and return a tuple with 'col1', 'col2', and your new column - in this case 'col3' (gen_col_3 would be your function):
def gen_col_3(col1, col2):
return col1 + col2
rdd = data.rdd.map(lambda x: (x['col1'], x['col2'], gen_col_3(x['col1'],x['col2'])))
Then you can convert back to a dataframe like so:
df = rdd.toDF(['col1','col2','col3'])
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 2| 3|
| 3| 4| 7|
| 5| 6| 11|
+----+----+----+
Upvotes: 1