Reputation: 229
I want to know if a record has update its date in a pandas Dataframe. The dataframe is made up of several columns in which for each value of A we have several values of B with start dates and end dates. Thanks to the timestamp we can know if there is a new record or a previous one has been modified.
What I want to know is how to be able to check if a new record has a date range close to other records in its group, for example B1 group and if they have a similar date range, delete the previous one and only leave the new record updated, but if it doesn't have a common range to interpret as a new record.
For example,
Input Dataframe:
A | B | Start | End | Timestamp |
---|---|---|---|---|
A1 | B1 | 2021-05-10 00:00:00 | 2021-05-27 00:00:00 | 2021-05-15 00:00:00 |
A1 | B1 | 2021-05-12 00:00:00 | 2021-05-30 00:00:00 | 2021-04-15 00:00:00 |
A1 | B1 | 2021-05-10 00:00:00 | 2021-05-12 00:00:00 | 2021-03-15 00:00:00 |
A1 | B2 | 2021-06-02 00:00:00 | 2021-06-04 00:00:00 | 2021-02-15 00:00:00 |
A2 | B3 | 2021-01-01 00:00:00 | 2022-01-01 00:00:00 | 2021-05-15 00:00:00 |
A2 | B3 | 2021-07-15 00:00:00 | 2021-08-15 00:00:00 | 2021-04-15 00:00:00 |
A2 | B4 | 2021-05-30 00:00:00 | 2021-06-15 00:00:00 | 2021-05-15 00:00:00 |
A2 | B4 | 2021-06-02 00:00:00 | 2021-06-17 00:00:00 | 2021-04-15 00:00:00 |
Expected Output:
A | B | Start | End | Timestamp |
---|---|---|---|---|
A1 | B1 | 2021-05-10 00:00:00 | 2021-05-27 00:00:00 | 2021-05-15 00:00:00 |
A1 | B2 | 2021-06-02 00:00:00 | 2021-06-04 00:00:00 | 2021-02-15 00:00:00 |
A2 | B3 | 2021-01-01 00:00:00 | 2022-01-01 00:00:00 | 2021-05-15 00:00:00 |
A2 | B3 | 2021-07-15 00:00:00 | 2021-08-15 00:00:00 | 2021-04-15 00:00:00 |
A2 | B4 | 2021-05-30 00:00:00 | 2021-06-15 00:00:00 | 2021-05-15 00:00:00 |
Thank you!
Upvotes: 0
Views: 101
Reputation: 15722
I'm not sure what you mean exactly with a 'close' date range, so this answer won't exactly match the output you listed in the question.
For demo purposes I've made a csv file called data.csv
with the data in your question
A,B,Start,End,Timestamp
A1,B1,2021-05-10 00:00:00,2021-05-27 00:00:00,2021-05-15 00:00:00
A1,B1,2021-05-12 00:00:00,2021-05-30 00:00:00,2021-04-15 00:00:00
A1,B1,2021-05-10 00:00:00,2021-05-12 00:00:00,2021-03-15 00:00:00
A1,B2,2021-06-02 00:00:00,2021-06-04 00:00:00,2021-02-15 00:00:00
A2,B3,2021-01-01 00:00:00,2022-01-01 00:00:00,2021-05-15 00:00:00
A2,B3,2021-07-15 00:00:00,2021-08-15 00:00:00,2021-04-15 00:00:00
A2,B4,2021-05-30 00:00:00,2021-06-15 00:00:00,2021-05-15 00:00:00
A2,B4,2021-06-02 00:00:00,2021-06-17 00:00:00,2021-04-15 00:00:00
An approach could be to compare the time differences for every group in the B
column. We'll start with a group you mentioned in your question, i.e. where the B
column value equals "B1"
:
import pandas as pd
df = pd.read_csv("data.csv")
dff = df[df["B"] == "B1"]
>>> dff
A B ... End Timestamp
0 A1 B1 ... 2021-05-27 00:00:00 2021-05-15 00:00:00
1 A1 B1 ... 2021-05-30 00:00:00 2021-04-15 00:00:00
2 A1 B1 ... 2021-05-12 00:00:00 2021-03-15 00:00:00
# Difference in number of days between start and end date
>>> (pd.to_datetime(dff.End) - pd.to_datetime(dff.Start)).dt.days
0 17
1 18
2 2
dtype: int64
# How does each time difference compare to the time difference in the first row
>>> (pd.to_datetime(dff.End) - pd.to_datetime(dff.Start)).dt.days.diff().fillna(0)
0 0.0
1 1.0
2 -16.0
dtype: float64
# Filter where the number of days difference compared to the first row is less than 7
>>> abs((pd.to_datetime(dff.End) - pd.to_datetime(dff.Start)).dt.days.diff().fillna(0)) < 7
0 True
1 True
2 False
dtype: bool
# Filter dff based on earlier condition
>>> dff[abs((pd.to_datetime(dff.End) - pd.to_datetime(dff.Start)).dt.days.diff().fillna(0)) < 7]
A B Start End Timestamp
0 A1 B1 2021-05-10 00:00:00 2021-05-27 00:00:00 2021-05-15 00:00:00
1 A1 B1 2021-05-12 00:00:00 2021-05-30 00:00:00 2021-04-15 00:00:00
Above we've only compared one group of the B
column. To do what we've done above for all groups, we could use a groupby
on the B
column. Then we could iterate through each group and filter each group using the filter mentioned earlier. After filtering all groups these filtered groups can be contained inside a list and concatenated together.
df = pd.concat([
group[
abs(
(pd.to_datetime(group.End) - pd.to_datetime(group.Start))
.dt.days.diff()
.fillna(0)
)
< 7
]
for name, group in df.groupby("B")
])
>>> df
A B Start End Timestamp
0 A1 B1 2021-05-10 00:00:00 2021-05-27 00:00:00 2021-05-15 00:00:00
1 A1 B1 2021-05-12 00:00:00 2021-05-30 00:00:00 2021-04-15 00:00:00
3 A1 B2 2021-06-02 00:00:00 2021-06-04 00:00:00 2021-02-15 00:00:00
4 A2 B3 2021-01-01 00:00:00 2022-01-01 00:00:00 2021-05-15 00:00:00
6 A2 B4 2021-05-30 00:00:00 2021-06-15 00:00:00 2021-05-15 00:00:00
7 A2 B4 2021-06-02 00:00:00 2021-06-17 00:00:00 2021-04-15 00:00:00
Adjust the degree of closeness according to your needs. I've used days here as a measurement, but you could use a different one. You could use seconds, microseconds, nanoseconds, etc... Look through the Series
documentation for more examples.
Upvotes: 2