Create pyspark logic to populate columns

Question

I have 2 dataframes, one from today and one from yesterday and I need help to create a logic to compare them and create a new dataframe with the following columns:

customer_id -> Column used to compare

person_id -> Column used to compare

type_person -> If the type_person column is the same as the previous day keep it, if not update to the new status according to today.

anterior_type -> If the type_person column was updated, put the previous status here.

update_date -> Date the record was updated if there is a new type_person.

create_date -> Date the record was created.

Dataframe today:

customer_id	person_id	type_person	insert_date
afabd2d2	4a5ae8a5-6682-467...	Online	2022-03-03
afabd2d2	1be8d3e8-8075-438...	Online	2022-03-03
afabd2d2	6912dadc-1692-4bd...	Online	2022-03-03
afabd2d2	e48cba37-113c-4bd...	Online	2022-03-03
afabd2d2	831cb669-b2ae-4e8...	Online	2022-03-03
afabd2d2	69161fe5-62ac-400...	Hybrid	2022-03-03
afabd2d2	b48b59a0-92eb-410...	Hybrid	2022-03-03

Dataframe yesterday:

customer_id	person_id	type_person	insert_date
afabd2d2	4a5ae8a5-6682-467...	Online	2022-03-02
afabd2d2	1be8d3e8-8075-438...	Online	2022-03-02
afabd2d2	6912dadc-1692-4bd...	Online	2022-03-02
afabd2d2	e48cba37-113c-4bd...	Online	2022-03-02
afabd2d2	831cb669-b2ae-4e8...	Online	2022-03-02
afabd2d2	69161fe5-62ac-400...	Online	2022-03-02
afabd2d2	b48b59a0-92eb-410...	Online	2022-03-02

In the case of today's dataframe I have 2 lines that changed to Hybrid, so they would have to have update_date and anterior_type, the others remain the same and with the other null fields in this dataframe.

Emma · Accepted Answer

I tweak the sample data to demonstrate the new record scenario. I added the last line.

customer_id	person_id	type_person	insert_date
afabd2d2	4a5ae8a5-6682-467...	Online	2022-03-03
afabd2d2	1be8d3e8-8075-438...	Online	2022-03-03
afabd2d2	6912dadc-1692-4bd...	Online	2022-03-03
afabd2d2	e48cba37-113c-4bd...	Online	2022-03-03
afabd2d2	831cb669-b2ae-4e8...	Online	2022-03-03
afabd2d2	69161fe5-62ac-400...	Hybrid	2022-03-03
afabd2d2	b48b59a0-92eb-410...	Hybrid	2022-03-03
afabd2d2	xxxxxxxx-xxxx-xxx...	Online	2022-03-03

To compare the difference, you can first join the 2 dataframes.

# dft: today's dataframe, dfy: yesterday's dataframe
# If you also want to track the deletion, change how from left to outer.
df = dft.join(dfy, on=['customer_id', 'person_id'], how='left')

Then, use when to do the comparison.

df = (df.withColumn('anterior_type', (F.when(dfy.type_person.isNull(), 'New')
                                       .when(dfy.type_person != dft.type_person, dfy.type_person)))
        .withColumn('update_date', F.when(dfy.type_person.isNull() | (dfy.type_person != dft.type_person), dft.insert_date))

Result

|customer_id|  person_id|type_person|insert_date|type_person|insert_date|anterior_type|update_date|
|-----------+-----------+-----------+-----------+-----------+-----------|-------------+-----------|
|   afabd2d2|4a5ae8a5...|     Online| 2022-03-03|     Online| 2022-03-02|         null|       null|
|   afabd2d2|1be8d3e8...|     Online| 2022-03-03|     Online| 2022-03-02|         null|       null|
|   afabd2d2|6912dadc...|     Online| 2022-03-03|     Online| 2022-03-02|         null|       null|
|   afabd2d2|e48cba37...|     Online| 2022-03-03|     Online| 2022-03-02|         null|       null|
|   afabd2d2|831cb669...|     Online| 2022-03-03|     Online| 2022-03-02|         null|       null|
|   afabd2d2|69161fe5...|     Hybrid| 2022-03-03|     Online| 2022-03-02|       Online| 2022-03-03|
|   afabd2d2|b48b59a0...|     Hybrid| 2022-03-03|     Online| 2022-03-02|       Online| 2022-03-03|
|   afabd2d2|xxxxxxxx...|     Online| 2022-03-03|       null|       null|          New| 2022-03-03|

If you need a value other than null for anterior_type or update_date, you can chain otherwise after then when to add the value.

Create pyspark logic to populate columns

Answers (1)

Related Questions