Reputation: 181
I have an collect data of dataFrame column in spark
temp = df.select('item_code').collect()
Result:
[Row(item_code=u'I0938'),
Row(item_code=u'I0009'),
Row(item_code=u'I0010'),
Row(item_code=u'I0010'),
Row(item_code=u'C0723'),
Row(item_code=u'I1097'),
Row(item_code=u'C0117'),
Row(item_code=u'I0009'),
Row(item_code=u'I0009'),
Row(item_code=u'I0009'),
Row(item_code=u'I0010'),
Row(item_code=u'I0009'),
Row(item_code=u'C0117'),
Row(item_code=u'I0009'),
Row(item_code=u'I0596')]
And now i would like assign a number for each word, if words is duplicate, it have the same number. I using Spark, RDD , not Pandas
Please help me resolve this problem!
Upvotes: 0
Views: 437
Reputation: 2468
You could create a new dataframe which has distinct values.
val data = temp.distinct()
Now you can assigne a unique id using
import org.apache.spark.sql.functions._
val dataWithId = data.withColumn("uniqueID",monotonicallyIncreasingId)
Now you can join this new dataframe with the original dataframe and select the unique id.
val tempWithId = temp.join(dataWithId, "item_code").select("item_code", "uniqueID")
The code is assuming scala. But something similar should exist for pyspark as well. Just consider this as a pointer.
Upvotes: 1