Caroline Griffin
Caroline Griffin

Reputation: 23

Easy way to fix wrong year (y2k bug) using pandas

I have a database with a column named ['birth_date'], already converted string -> date using:

dataCopy.loc[:,'birth_date'] = dataCopy['birth_date'].astype('datetime64[ns]')

I also converted other columns my db has. So, as some of you know there is an issue with 2 digits year dates (mm/dd/yy or whatever) that when python sees a date like mm/dd/69 and below it assumes the year is 2069 and not 1969. Problem is I need to subtract this column with another column to pick the age my customer had when he canceled the service. Example: He was born in 1969 and canceled the service in 2019, so he was 53 years old. I already know how I can do it:

dataCopy['idade'] = (dataCopy['deleted_at'].dt.year - dataCopy['birth_date'].dt.year)

But first I need to fix the wrong years. Using format (y%m%d and variations) doesn't work. I mean, they work but they don't fix the wrong years. I am a beginner, already tried functions I saw here on Stack but I couldn't modify it to match my problem (plus I didn't understand it 100%). I appreciate any help.

Upvotes: 2

Views: 376

Answers (2)

AlexWach
AlexWach

Reputation: 602

slightly simpler than @Fahids solution but using the same conditional approach

import pandas as pd

# produce sample data
df = pd.DataFrame({'date': pd.to_datetime(['01-01-34', '01-01-66', '01-01-19', '01-01-20'], format='%m-%d-%y'), 'value': [1, 2, 3, 4]})
print(df)

output

        date  value
0 2034-01-01      1
1 2066-01-01      2
2 2019-01-01      3
3 2020-01-01      4


# define latest possible date the data can be from
cutoff_date = pd.to_datetime('01-01-2020')

# substract 100 years from date > cutoff date
df.loc[df.date > cutoff_date, 'date'] -= pd.DateOffset(years=100)

print(df)

output

        date  value
0 1934-01-01      1
1 1966-01-01      2
2 2019-01-01      3
3 2020-01-01      4

Upvotes: 1

Fahid Latheef A
Fahid Latheef A

Reputation: 21

You need to create a custom function and map it to the birth_date column.

You can decide a cutoff year (say 40) above which will be classified to 19th century and below which will be classified to 20th century. For example, 62 will be converted to year 1962, 32 will be converted to 2032.

The below code create the custom function that converts the datestring.

import pandas as pd
import datetime as dt


def custom_date_function(date_string: str) -> dt.date:
    """
    Convert date string to date object
    """
    # Note that the first 8 character is the date without the time
    # Selecting the first 8 character
    # And then splitting the string using '/' to year, month and date
    date_components = date_string[0:8].split('/')

    # All number greater than 40 will be changed to 19th century
    # Else to 20th century
    # You may change the cutoff from 40

    if int(date_components[2]) >= 40:
        year = 1900 + int(date_components[2])
    else:
        year = 2000 + int(date_components[2])

    return dt.date(year=year, month=int(date_components[0]), day=int(date_components[1]))

Once the custom function is created, you may use it on your birth_date column.

# Example Code of applying the custom function on birth_date DataFrame column

# Creating an example DataFrame with birth_date column
df_dict = {'birth_date': ['11/22/67', '03/23/69', '11/22/27']}
dataCopy = pd.DataFrame(df_dict)

# Applying the function on birth_date DataFrame column
out = dataCopy['birth_date'].apply(custom_date_function)
print(out)

There is a possibility that birth_date column is already a date object. In that case you need to convert it to string before applying the custom_date_function.

Upvotes: 1

Related Questions