Tom
Tom

Reputation: 115

Pandas Datetime conversion CET/CEST to UTC

I have a df datetime column that I want to convert from Europe/Copenhgaen t.z to UTC but I just keep getting duplicate entries in the UTC column. The reason this happens is because of how I make my datetime column.

My data comes in as a df with the columns date (with a start time set to 00:00), position, resolution, ID and quantity. For a given date there are between 23 and 25 numerical positions, depending on the day, for each ID and for the sake of simplicity lets say resolution is just hourly and we have one ID.

I have a function that converts the date, resolution and position to a datetime, which I then set to the requisite time zone, from which I convert to UTC. However, I get the problem that the UTC date column contains a duplicate date rather than the required date, 2024-10-27 01:00

Example:

import pytz
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Set up date range from 25th October 2024 to 28th October 2024
date_range = pd.date_range(start="2024-10-25", end="2024-10-28", freq="D")

# Initialize an empty list to collect data
data = []

# Populate data for each date
for date in date_range:
    # Set number of positions based on date
    if date == datetime(2024, 10, 27):
        positions = range(1, 26)  # 25 positions on 27th October
    else:
        positions = range(1, 25)  # 24 positions on other dates
    
    # Generate data entries for each position
    for pos in positions:
        data.append({
            "Date": date,
            "position": pos,
            "ID": "A",
            "quantity": np.random.randint(1, 100),
            "resolution": 'PT1H'
        })

# Create DataFrame
df = pd.DataFrame(data)

copenhagen_tz = pytz.timezone('Europe/Copenhagen')
df['Date'] = pd.to_datetime(df['Date'],utc=False)

# Define a dictionary to map resolution text to timedelta values
resolution_map = {
    'PT1H': timedelta(hours=1),
    'PT30M': timedelta(minutes=30),
    'PT15M': timedelta(minutes=15),
    'PT5M': timedelta(minutes=5)
}

# Calculate the exact datetime by applying the resolution offset
def calculate_exact_date(row):
    base_time = row['Date']
    offset = (row['position'] - 1) * resolution_map[row['resolution']]
    # Handle DST ambiguity by using `ambiguous=True` to interpret ambiguous times consistently
    exact_date = base_time + offset

    return copenhagen_tz.normalize(exact_date)

# Convert 'Date' to datetime in UTC, then localize to Europe/Copenhagen
df['Date'] = pd.to_datetime(df['Date']).dt.tz_localize('CET')

df['Date_Local'] = df.apply(calculate_exact_date, axis=1)
df['Date_UTC'] = pd.to_datetime(df['Date_Local']).dt.tz_convert('UTC') 

Any ideas how I can do this differently? I cannot change the format of the data I receive, it will alsways have a position and starting datetime.

Upvotes: -1

Views: 152

Answers (1)

Mark Tolonen
Mark Tolonen

Reputation: 178031

Due to CEST to CET time change, there are two 0200 times. If you do the time math in local time you end up with two UTC times that are the same. Do the math in UTC:

import pandas as pd
from datetime import datetime, timedelta
from zoneinfo import ZoneInfo  # built-in instead of 3rd party pytz.

zone = ZoneInfo('Europe/Copenhagen')
# Generate times in CEST/CET
date_range = pd.date_range(start='2024-10-27', end='2024-10-27', freq='D', tz=zone)

data = []
for date in date_range:
    for pos in range(1, 26):
        data.append({'Date': date, 'position': pos})

df = pd.DataFrame(data)

def calculate_exact_date(row):
    # Math in UTC
    base_time = row['Date']
    offset = (row['position'] - 1) * timedelta(hours=1)
    exact_date = base_time + offset
    # Return local time
    return exact_date.astimezone(zone)

# Convert to UTC
df['Date'] = pd.to_datetime(df['Date']).dt.tz_convert('UTC')
df['Date_Local'] = df.apply(calculate_exact_date, axis=1)
# Make the UTC-equivalent column
df['Date_UTC'] = pd.to_datetime(df['Date_Local']).dt.tz_convert('UTC')
del df['Date']  # To see Date_Local and Date_UTC in the default print below
print(df)

Output:

    position                Date_Local                  Date_UTC
0          1 2024-10-27 00:00:00+02:00 2024-10-26 22:00:00+00:00
1          2 2024-10-27 01:00:00+02:00 2024-10-26 23:00:00+00:00
2          3 2024-10-27 02:00:00+02:00 2024-10-27 00:00:00+00:00  # 0200 CEST
3          4 2024-10-27 02:00:00+01:00 2024-10-27 01:00:00+00:00  # 0200 CET
4          5 2024-10-27 03:00:00+01:00 2024-10-27 02:00:00+00:00
5          6 2024-10-27 04:00:00+01:00 2024-10-27 03:00:00+00:00
6          7 2024-10-27 05:00:00+01:00 2024-10-27 04:00:00+00:00
7          8 2024-10-27 06:00:00+01:00 2024-10-27 05:00:00+00:00
8          9 2024-10-27 07:00:00+01:00 2024-10-27 06:00:00+00:00
9         10 2024-10-27 08:00:00+01:00 2024-10-27 07:00:00+00:00
10        11 2024-10-27 09:00:00+01:00 2024-10-27 08:00:00+00:00
11        12 2024-10-27 10:00:00+01:00 2024-10-27 09:00:00+00:00
12        13 2024-10-27 11:00:00+01:00 2024-10-27 10:00:00+00:00
13        14 2024-10-27 12:00:00+01:00 2024-10-27 11:00:00+00:00
14        15 2024-10-27 13:00:00+01:00 2024-10-27 12:00:00+00:00
15        16 2024-10-27 14:00:00+01:00 2024-10-27 13:00:00+00:00
16        17 2024-10-27 15:00:00+01:00 2024-10-27 14:00:00+00:00
17        18 2024-10-27 16:00:00+01:00 2024-10-27 15:00:00+00:00
18        19 2024-10-27 17:00:00+01:00 2024-10-27 16:00:00+00:00
19        20 2024-10-27 18:00:00+01:00 2024-10-27 17:00:00+00:00
20        21 2024-10-27 19:00:00+01:00 2024-10-27 18:00:00+00:00
21        22 2024-10-27 20:00:00+01:00 2024-10-27 19:00:00+00:00
22        23 2024-10-27 21:00:00+01:00 2024-10-27 20:00:00+00:00
23        24 2024-10-27 22:00:00+01:00 2024-10-27 21:00:00+00:00
24        25 2024-10-27 23:00:00+01:00 2024-10-27 22:00:00+00:00

Upvotes: 0

Related Questions