Reputation: 194
I have a spark dataframe of six columns say (col1, col2,...col6). I want to create a unique id for each combination of values from "col1" and "col2" and add it to the dataframe. Can someone help me with some pyspark code on how to do it?
Upvotes: 0
Views: 5456
Reputation: 21
You can use the rank function.
rdd=sc.parallelize([[12,23,3,4,5,6],[12,23,56,67,89,20],[12,23,0,0,0,0],[12,2,12,12,12,23],[1,2,3,4,56,7],[1,2,3,4,56,7]])
df = rdd.toDF(['col_1','col_2','col_3','col_4','col_5','col_6'])
df.show()
+-----+-----+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|col_5|col_6|
+-----+-----+-----+-----+-----+-----+
| 12| 23| 3| 4| 5| 6|
| 12| 23| 56| 67| 89| 20|
| 12| 23| 0| 0| 0| 0|
| 12| 2| 12| 12| 12| 23|
| 1| 2| 3| 4| 56| 7|
| 1| 2| 3| 4| 56| 7|
+-----+-----+-----+-----+-----+-----+
To generate unique id, pass the list of columns in the orderby function of Window
from pyspark.sql.functions import rank
df.withColumn("uid", rank().over(Window.orderBy("col_1", "col_2"))).show()
+-----+-----+-----+-----+-----+-----+---+
|col_1|col_2|col_3|col_4|col_5|col_6|uid|
+-----+-----+-----+-----+-----+-----+---+
|1 |2 |3 |4 |56 |7 |1 |
|1 |2 |3 |4 |56 |7 |1 |
|12 |2 |12 |12 |12 |23 |3 |
|12 |23 |56 |67 |89 |20 |4 |
|12 |23 |3 |4 |5 |6 |4 |
|12 |23 |0 |0 |0 |0 |4 |
+-----+-----+-----+-----+-----+-----+---+
Upvotes: 1
Reputation: 1588
You can achieve it using monotonically_increasing_id
(pyspark >1.6) or monotonicallyIncreasingId
(pyspark <1.6)
>>> from pyspark.sql.functions import monotonically_increasing_id
>>> rdd=sc.parallelize([[12,23,3,4,5,6],[12,23,56,67,89,20],[12,23,0,0,0,0],[12,2,12,12,12,23],[1,2,3,4,56,7],[1,2,3,4,56,7]])
>>> df = rdd.toDF(['col_1','col_2','col_3','col_4','col_5','col_6'])
>>> df.show()
+-----+-----+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|col_5|col_6|
+-----+-----+-----+-----+-----+-----+
| 12| 23| 3| 4| 5| 6|
| 12| 23| 56| 67| 89| 20|
| 12| 23| 0| 0| 0| 0|
| 12| 2| 12| 12| 12| 23|
| 1| 2| 3| 4| 56| 7|
| 1| 2| 3| 4| 56| 7|
+-----+-----+-----+-----+-----+-----+
>>> df_1=df.groupBy(df.col_1,df.col_2).count().withColumn("id", monotonically_increasing_id()).select(['col_1','col_2','id'])
>>> df_1.show()
+-----+-----+-------------+
|col_1|col_2| id|
+-----+-----+-------------+
| 12| 23| 34359738368|
| 1| 2|1434519076864|
| 12| 2|1554778161152|
+-----+-----+-------------+
>>> df.join(df_1,(df.col_1==df_1.col_1) & (df.col_2==df_1.col_2)).drop(df_1.col_1).drop(df_1.col_2).show()
+-----+-----+-----+-----+-----+-----+-------------+
|col_3|col_4|col_5|col_6|col_1|col_2| id|
+-----+-----+-----+-----+-----+-----+-------------+
| 3| 4| 5| 6| 12| 23| 34359738368|
| 56| 67| 89| 20| 12| 23| 34359738368|
| 0| 0| 0| 0| 12| 23| 34359738368|
| 3| 4| 56| 7| 1| 2|1434519076864|
| 3| 4| 56| 7| 1| 2|1434519076864|
| 12| 12| 12| 23| 12| 2|1554778161152|
+-----+-----+-----+-----+-----+-----+-------------+
Upvotes: 4
Reputation: 7316
If you really need to generate the unique ID from col1 and col2 you can also create a hash value leveraging the sha2
function of Spark.
First let's generate some dummy data with:
from random import randint
max_range = 10
df1 = spark.createDataFrame(
[(x, x * randint(1, max_range), x * 10 * randint(1, max_range)) for x in range(1, max_range)],
['C1', 'C2', 'C3'])
>>> df1.show()
+---+---+---+
| C1| C2| C3|
+---+---+---+
| 1| 1| 60|
| 2| 14|180|
| 3| 21|270|
| 4| 16|360|
| 5| 35|250|
| 6| 30|480|
| 7| 28|210|
| 8| 80|320|
| 9| 45|360|
+---+---+---+
Then create a new uid column from columns C2 and C3 with the next code:
from pyspark.sql.functions import col, sha2, concat
df1.withColumn("uid", sha2(concat(col("C2"), col("C3")), 256)).show(10, False)
And the output:
+---+---+---+--------------------+
| C1| C2| C3| uid|
+---+---+---+--------------------+
| 1| 1| 60|a512db2741cd20693...|
| 2| 14|180|2f6543dc6c0e06e4a...|
| 3| 21|270|bd3c65ddde4c6f733...|
| 4| 16|360|c7a1e8c59fc9dcc21...|
| 5| 35|250|cba1aeb7a72d9ae27...|
| 6| 30|480|ad7352ff8927cf790...|
| 7| 28|210|ea7bc25aa7cd3503f...|
| 8| 80|320|02e1d953517339552...|
| 9| 45|360|b485cf8f710a65755...|
+---+---+---+--------------------+
Upvotes: 4