zaoras
zaoras

Reputation: 71

Change year based on start and end date in dataframe

I had a column in data frame called startEndDate, example: '10.12-20.05.2019', divided those to columns start_date and end_date with same year, example: start_date '10.12.2019' and end_date '20.05.2019'. But year in this example is wrong, as it should be 2018 because start date cannot be after end date. How can I compare entire dataframe and replace values so it contains correct start_dates based on if statement(because some start dates should stay with year as 2019)?

Upvotes: 0

Views: 570

Answers (2)

Rawson
Rawson

Reputation: 2787

This method first splits up the date string to two dates and creates start and end date columns. Then it subtracts 1 year from the start date if it is greater than the end date.

import pandas as pd
import numpy as np

# mock data
df = pd.DataFrame({"dates": ["10.12-20.05.2019", "02.04-31.10.2019"]})

# split date string to two dates, convert to datetime and stack to columns
df[["start", "end"]] = np.vstack(
    df.dates.apply(lambda x: pd.to_datetime(
        [x.split("-")[0] + x[-5:],
         x.split("-")[1]], format="%d.%m.%Y")))

# subtract 1 year from start date if greater than end date
df["start"] = np.where(df["start"]>df["end"],
                       df["start"] - pd.DateOffset(years=1),
                       df["start"])

df
#              dates      start        end
#0  10.12-20.05.2019 2018-12-10 2019-05-20
#1  02.04-31.10.2019 2019-04-02 2019-10-31

Although I have used split here for the initial splitting of the string, as there will always be 5 characters before the hyphen, and the date will always be the last 5 (with the .), there is no need to use the split and instead that line could change to:

df[["start", "end"]] = np.vstack(
    df.dates.apply(lambda x: pd.to_datetime(
        [x[:5] + x[-5:],
         x[6:]], format="%d.%m.%Y")))

Upvotes: 1

ArchAngelPwn
ArchAngelPwn

Reputation: 3046

This will show you which rows the start_date is > than the end date

data = {
    'Start_Date' : ['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04'],
    'End_Date' : ['2020-02-01', '2019-01-02', '2019-01-03', '2020-01-05']
}
df = pd.DataFrame(data)
df['Start_Date'] = pd.to_datetime(df['Start_Date'], infer_datetime_format=True)
df['End_Date'] = pd.to_datetime(df['End_Date'], infer_datetime_format=True)
df['Check'] = np.where(df['Start_Date'] > df['End_Date'], 'Error', 'No Error')
df

Without seeing more of your data or your intended final data this is the best we will be able to do to help identify problems in the data.

Upvotes: 1

Related Questions