Reputation: 119
I have 2 dataframes, one from today and one from yesterday and I need help to create a logic to compare them and create a new dataframe with the following columns:
customer_id
-> Column used to compare
person_id
-> Column used to compare
type_person
-> If the type_person column is the same as the previous day keep it, if not update to the new status according to today.
anterior_type
-> If the type_person column was updated, put the previous status here.
update_date
-> Date the record was updated if there is a new type_person.
create_date
-> Date the record was created.
Dataframe today:
customer_id | person_id | type_person | insert_date |
---|---|---|---|
afabd2d2 | 4a5ae8a5-6682-467... | Online | 2022-03-03 |
afabd2d2 | 1be8d3e8-8075-438... | Online | 2022-03-03 |
afabd2d2 | 6912dadc-1692-4bd... | Online | 2022-03-03 |
afabd2d2 | e48cba37-113c-4bd... | Online | 2022-03-03 |
afabd2d2 | 831cb669-b2ae-4e8... | Online | 2022-03-03 |
afabd2d2 | 69161fe5-62ac-400... | Hybrid | 2022-03-03 |
afabd2d2 | b48b59a0-92eb-410... | Hybrid | 2022-03-03 |
Dataframe yesterday:
customer_id | person_id | type_person | insert_date |
---|---|---|---|
afabd2d2 | 4a5ae8a5-6682-467... | Online | 2022-03-02 |
afabd2d2 | 1be8d3e8-8075-438... | Online | 2022-03-02 |
afabd2d2 | 6912dadc-1692-4bd... | Online | 2022-03-02 |
afabd2d2 | e48cba37-113c-4bd... | Online | 2022-03-02 |
afabd2d2 | 831cb669-b2ae-4e8... | Online | 2022-03-02 |
afabd2d2 | 69161fe5-62ac-400... | Online | 2022-03-02 |
afabd2d2 | b48b59a0-92eb-410... | Online | 2022-03-02 |
In the case of today's dataframe I have 2 lines that changed to Hybrid, so they would have to have update_date
and anterior_type
, the others remain the same and with the other null fields in this dataframe.
Upvotes: 0
Views: 74
Reputation: 9308
I tweak the sample data to demonstrate the new record scenario. I added the last line.
customer_id | person_id | type_person | insert_date |
---|---|---|---|
afabd2d2 | 4a5ae8a5-6682-467... | Online | 2022-03-03 |
afabd2d2 | 1be8d3e8-8075-438... | Online | 2022-03-03 |
afabd2d2 | 6912dadc-1692-4bd... | Online | 2022-03-03 |
afabd2d2 | e48cba37-113c-4bd... | Online | 2022-03-03 |
afabd2d2 | 831cb669-b2ae-4e8... | Online | 2022-03-03 |
afabd2d2 | 69161fe5-62ac-400... | Hybrid | 2022-03-03 |
afabd2d2 | b48b59a0-92eb-410... | Hybrid | 2022-03-03 |
afabd2d2 | xxxxxxxx-xxxx-xxx... | Online | 2022-03-03 |
To compare the difference, you can first join the 2 dataframes.
# dft: today's dataframe, dfy: yesterday's dataframe
# If you also want to track the deletion, change how from left to outer.
df = dft.join(dfy, on=['customer_id', 'person_id'], how='left')
Then, use when
to do the comparison.
df = (df.withColumn('anterior_type', (F.when(dfy.type_person.isNull(), 'New')
.when(dfy.type_person != dft.type_person, dfy.type_person)))
.withColumn('update_date', F.when(dfy.type_person.isNull() | (dfy.type_person != dft.type_person), dft.insert_date))
Result
|customer_id| person_id|type_person|insert_date|type_person|insert_date|anterior_type|update_date|
|-----------+-----------+-----------+-----------+-----------+-----------|-------------+-----------|
| afabd2d2|4a5ae8a5...| Online| 2022-03-03| Online| 2022-03-02| null| null|
| afabd2d2|1be8d3e8...| Online| 2022-03-03| Online| 2022-03-02| null| null|
| afabd2d2|6912dadc...| Online| 2022-03-03| Online| 2022-03-02| null| null|
| afabd2d2|e48cba37...| Online| 2022-03-03| Online| 2022-03-02| null| null|
| afabd2d2|831cb669...| Online| 2022-03-03| Online| 2022-03-02| null| null|
| afabd2d2|69161fe5...| Hybrid| 2022-03-03| Online| 2022-03-02| Online| 2022-03-03|
| afabd2d2|b48b59a0...| Hybrid| 2022-03-03| Online| 2022-03-02| Online| 2022-03-03|
| afabd2d2|xxxxxxxx...| Online| 2022-03-03| null| null| New| 2022-03-03|
If you need a value other than null
for anterior_type
or update_date
, you can chain otherwise
after then when
to add the value.
Upvotes: 1