Reputation: 1159
I have a pyspark DataFrame and I want to get a specific column and iterate over its values. For example:
userId itemId
1 2
2 2
3 7
4 10
I get the userId column by df.userId
and for each userId in this column I want to apply a method. How can I achieve this?
Upvotes: 3
Views: 2380
Reputation: 21766
Your question is not very specific about the type of function you want to apply, so I have created an example that adds an item description based on the value of itemId
.
First let's import the relevant libraries and create the data:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
df = spark.createDataFrame([(1,2),(2,2),(3,7),(4,10)], ['userId', 'itemId'])
Secondly, create the function and convert it into an UDF function that can be used by PySpark:
def item_description(itemId):
items = {2 : "iPhone 8",
7 : "Apple iMac",
10 : "iPad"}
return items[itemId]
item_description_udf = udf(item_description,StringType())
Finally, add a new column for ItemDescription
and populate it with the value returned by the item_description_udf
function:
df = df.withColumn("ItemDescription",item_description_udf(df.itemId))
df.show()
This gives the following output:
+------+------+---------------+
|userId|itemId|ItemDescription|
+------+------+---------------+
| 1| 2| iPhone 8|
| 2| 2| iPhone 8|
| 3| 7| Apple iMac|
| 4| 10| iPad|
+------+------+---------------+
Upvotes: 4