taurz
taurz

Reputation: 194

create unique id for combination of a pair of values from two columns in a spark dataframe

I have a spark dataframe of six columns say (col1, col2,...col6). I want to create a unique id for each combination of values from "col1" and "col2" and add it to the dataframe. Can someone help me with some pyspark code on how to do it?

Upvotes: 0

Views: 5456

Answers (3)

Hrushikesh V
Hrushikesh V

Reputation: 21

You can use the rank function.

rdd=sc.parallelize([[12,23,3,4,5,6],[12,23,56,67,89,20],[12,23,0,0,0,0],[12,2,12,12,12,23],[1,2,3,4,56,7],[1,2,3,4,56,7]])
df = rdd.toDF(['col_1','col_2','col_3','col_4','col_5','col_6'])
df.show()
+-----+-----+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|col_5|col_6|
+-----+-----+-----+-----+-----+-----+
|   12|   23|    3|    4|    5|    6|
|   12|   23|   56|   67|   89|   20|
|   12|   23|    0|    0|    0|    0|
|   12|    2|   12|   12|   12|   23|
|    1|    2|    3|    4|   56|    7|
|    1|    2|    3|    4|   56|    7|
+-----+-----+-----+-----+-----+-----+

To generate unique id, pass the list of columns in the orderby function of Window

from pyspark.sql.functions import rank
df.withColumn("uid", rank().over(Window.orderBy("col_1", "col_2"))).show()

+-----+-----+-----+-----+-----+-----+---+
|col_1|col_2|col_3|col_4|col_5|col_6|uid|
+-----+-----+-----+-----+-----+-----+---+
|1    |2    |3    |4    |56   |7    |1  |
|1    |2    |3    |4    |56   |7    |1  |
|12   |2    |12   |12   |12   |23   |3  |
|12   |23   |56   |67   |89   |20   |4  |
|12   |23   |3    |4    |5    |6    |4  |
|12   |23   |0    |0    |0    |0    |4  |
+-----+-----+-----+-----+-----+-----+---+

Upvotes: 1

Jim Todd
Jim Todd

Reputation: 1588

You can achieve it using monotonically_increasing_id(pyspark >1.6) or monotonicallyIncreasingId(pyspark <1.6)

>>> from pyspark.sql.functions import monotonically_increasing_id
>>> rdd=sc.parallelize([[12,23,3,4,5,6],[12,23,56,67,89,20],[12,23,0,0,0,0],[12,2,12,12,12,23],[1,2,3,4,56,7],[1,2,3,4,56,7]])
>>> df = rdd.toDF(['col_1','col_2','col_3','col_4','col_5','col_6'])
>>> df.show()
+-----+-----+-----+-----+-----+-----+
|col_1|col_2|col_3|col_4|col_5|col_6|
+-----+-----+-----+-----+-----+-----+
|   12|   23|    3|    4|    5|    6|
|   12|   23|   56|   67|   89|   20|
|   12|   23|    0|    0|    0|    0|
|   12|    2|   12|   12|   12|   23|
|    1|    2|    3|    4|   56|    7|
|    1|    2|    3|    4|   56|    7|
+-----+-----+-----+-----+-----+-----+

>>> df_1=df.groupBy(df.col_1,df.col_2).count().withColumn("id", monotonically_increasing_id()).select(['col_1','col_2','id'])
>>> df_1.show()
+-----+-----+-------------+
|col_1|col_2|           id|
+-----+-----+-------------+
|   12|   23|  34359738368|
|    1|    2|1434519076864|
|   12|    2|1554778161152|
+-----+-----+-------------+

>>> df.join(df_1,(df.col_1==df_1.col_1) & (df.col_2==df_1.col_2)).drop(df_1.col_1).drop(df_1.col_2).show()
+-----+-----+-----+-----+-----+-----+-------------+
|col_3|col_4|col_5|col_6|col_1|col_2|           id|
+-----+-----+-----+-----+-----+-----+-------------+
|    3|    4|    5|    6|   12|   23|  34359738368|
|   56|   67|   89|   20|   12|   23|  34359738368|
|    0|    0|    0|    0|   12|   23|  34359738368|
|    3|    4|   56|    7|    1|    2|1434519076864|
|    3|    4|   56|    7|    1|    2|1434519076864|
|   12|   12|   12|   23|   12|    2|1554778161152|
+-----+-----+-----+-----+-----+-----+-------------+

Upvotes: 4

abiratsis
abiratsis

Reputation: 7316

If you really need to generate the unique ID from col1 and col2 you can also create a hash value leveraging the sha2 function of Spark.

First let's generate some dummy data with:

from random import randint

max_range = 10
df1 = spark.createDataFrame(
            [(x, x * randint(1, max_range), x * 10 * randint(1, max_range)) for x in range(1, max_range)], 
            ['C1', 'C2', 'C3'])

>>> df1.show()
+---+---+---+
| C1| C2| C3|
+---+---+---+
|  1|  1| 60|
|  2| 14|180|
|  3| 21|270|
|  4| 16|360|
|  5| 35|250|
|  6| 30|480|
|  7| 28|210|
|  8| 80|320|
|  9| 45|360|
+---+---+---+

Then create a new uid column from columns C2 and C3 with the next code:

from pyspark.sql.functions import col, sha2, concat

df1.withColumn("uid", sha2(concat(col("C2"), col("C3")), 256)).show(10, False)

And the output:

+---+---+---+--------------------+
| C1| C2| C3|                 uid|
+---+---+---+--------------------+
|  1|  1| 60|a512db2741cd20693...|
|  2| 14|180|2f6543dc6c0e06e4a...|
|  3| 21|270|bd3c65ddde4c6f733...|
|  4| 16|360|c7a1e8c59fc9dcc21...|
|  5| 35|250|cba1aeb7a72d9ae27...|
|  6| 30|480|ad7352ff8927cf790...|
|  7| 28|210|ea7bc25aa7cd3503f...|
|  8| 80|320|02e1d953517339552...|
|  9| 45|360|b485cf8f710a65755...|
+---+---+---+--------------------+

Upvotes: 4

Related Questions