The Data Scientist
The Data Scientist

Reputation: 1

PySpark: Generating a Unique IDs for Data with Non-uniform and Repeated IDs

I have a dataset that is as indicated below. The first column is a date column, the second column comprises an id that can apply to any user (within the same department) and a given user can have multiple of these ids. The 3rd column is specific to a particular user but every so often, this id can be changed in case a user forgets it. The last column comprises of names and individuals can have similar names. I want to harmonize columns 1 and 2 prior to analysis as indicated below.

Input Dataset

Date Id2 id3 Name
2018-01-01 001 AA Name1
2018-01-02 001 AB Name1
2018-01-01 001 AC Name1
2018-01-04 003 AA Name1
2018-01-01 004 AA Name1
2018-01-01 001 AD Name2
2018-01-04 002 AE Name3
2018-01-04 005 AG Name1

Desired output:

Date Id2 id3 Name
2018-01-01 001 AA Name1
2018-01-02 001 AA Name1
2018-01-01 001 AA Name1
2018-01-04 001 AA Name1
2018-01-01 001 AA Name1
2018-01-01 001 AD Name2
2018-01-04 002 AE Name3
2018-01-04 005 AG Name1

Upvotes: 0

Views: 130

Answers (2)

Paula Nkuma
Paula Nkuma

Reputation: 21

so we have a table below and you are trying to map id2 column to just one id3 column because you said that they can be mapped to multiple ids, this would be my approach.

+---+----
|id2|id3|
+---+----
|001| AA| 
|001| AC| 
|003| BB|
|003| DE| 
+---+---

First i would create a flag column that flags 1 ID2 column that has multiple ID3 with this code.

import pyspark.sql.functions as F
from pyspark.sql import Window

flag_df = dataframe.withColumn(
"dedup_flag",
F.dense_rank().over(
    Window.partitionBy(
        'id2'
    ).orderBy(F.col('id3').desc())) # this would help you pick the highest or lowest value in id3
 )

which results in

+---+----+-----------+
|id2|id3|dedup_flag  |
+---+----+-----------+
|001| AA|           1|
|001| AC|           2|
|003| BB|           1|
|003| DE|           2|
+---+----+-----------+

Then all you need to do is filter where dedup_flag = 1 and just do a left join with your original dataframe and flag dataframe on id2 to pull in all id3's associated to a given id2 with this code below.

final_df = dataframe.drop('id3').join(
flag_df.filter(F.col('dedup_flag') == 1).select('id2','id3'),
on= 'id2',
how= 'left'
)

which results to

+---+----
|id2|id3|
+---+----
|001| AA| 
|001| AA| 
|003| BB|
|003| BB| 
+---+---

I hope this helped, acknowledge if this worked for you :) -Paula

Upvotes: 1

Paula Nkuma
Paula Nkuma

Reputation: 21

before I can help I need to understand the question, so for each unique ID in column 2 generate a unique ID column?

Upvotes: 0

Related Questions