Reputation: 1

PySpark: Generating a Unique IDs for Data with Non-uniform and Repeated IDs

I have a dataset that is as indicated below. The first column is a date column, the second column comprises an id that can apply to any user (within the same department) and a given user can have multiple of these ids. The 3rd column is specific to a particular user but every so often, this id can be changed in case a user forgets it. The last column comprises of names and individuals can have similar names. I want to harmonize columns 1 and 2 prior to analysis as indicated below.

Input Dataset

Date	Id2	id3	Name
2018-01-01	001	AA	Name1
2018-01-02	001	AB	Name1
2018-01-01	001	AC	Name1
2018-01-04	003	AA	Name1
2018-01-01	004	AA	Name1
2018-01-01	001	AD	Name2
2018-01-04	002	AE	Name3
2018-01-04	005	AG	Name1

Desired output:

Date	Id2	id3	Name
2018-01-01	001	AA	Name1
2018-01-02	001	AA	Name1
2018-01-01	001	AA	Name1
2018-01-04	001	AA	Name1
2018-01-01	001	AA	Name1
2018-01-01	001	AD	Name2
2018-01-04	002	AE	Name3
2018-01-04	005	AG	Name1

Upvotes: 0

Answers (2)

Paula Nkuma

Reputation: 21

so we have a table below and you are trying to map id2 column to just one id3 column because you said that they can be mapped to multiple ids, this would be my approach.

+---+----
|id2|id3|
+---+----
|001| AA| 
|001| AC| 
|003| BB|
|003| DE| 
+---+---

First i would create a flag column that flags 1 ID2 column that has multiple ID3 with this code.

import pyspark.sql.functions as F
from pyspark.sql import Window

flag_df = dataframe.withColumn(
"dedup_flag",
F.dense_rank().over(
    Window.partitionBy(
        'id2'
    ).orderBy(F.col('id3').desc())) # this would help you pick the highest or lowest value in id3
 )

which results in

+---+----+-----------+
|id2|id3|dedup_flag  |
+---+----+-----------+
|001| AA|           1|
|001| AC|           2|
|003| BB|           1|
|003| DE|           2|
+---+----+-----------+

Then all you need to do is filter where dedup_flag = 1 and just do a left join with your original dataframe and flag dataframe on id2 to pull in all id3's associated to a given id2 with this code below.

final_df = dataframe.drop('id3').join(
flag_df.filter(F.col('dedup_flag') == 1).select('id2','id3'),
on= 'id2',
how= 'left'
)

which results to

+---+----
|id2|id3|
+---+----
|001| AA| 
|001| AA| 
|003| BB|
|003| BB| 
+---+---

I hope this helped, acknowledge if this worked for you :) -Paula

Upvotes: 1

Paula Nkuma

Reputation: 21

before I can help I need to understand the question, so for each unique ID in column 2 generate a unique ID column?

Upvotes: 0

PySpark: Generating a Unique IDs for Data with Non-uniform and Repeated IDs

Answers (2)

Related Questions