Changing weather data frequency from 3 hours to 1 hour

Question

I have weather data which has the following column where the first 3 rows look like this

date	hour	city	condition	snow	rain
2023-01-30	3	berlin	snow	1	0
2023-01-30	6	berlin	rain	0	1
2023-01-30	9	berlin	clear	0	0

I want to write code where which will create rows for the missing hours and replace the values with the hour city and date closest to that hour. The result dataframe should look like

date	hour	city	condition	snow	rain
2023-01-30	3	berlin	snow	1	0
2023-01-30	4	berlin	snow	1	0
2023-01-30	5	berlin	snow	1	0
2023-01-30	6	berlin	rain	0	1
2023-01-30	7	berlin	rain	0	1
2023-01-30	8	berlin	rain	0	1
2023-01-30	9	berlin	clear	0	0
2023-01-30	10	berlin	clear	0	0
2023-01-30	10	berlin	clear	0	0

Note: I have many cities and many rows.

I tried this but dint get the right solution and its not optimum for large number of rows (cities and hours)

df_expanded = df.set_index(['date', 'city', 'condition'])\
                .hour.unstack().reset_index().melt(id_vars=['date', 'city', 'condition'], value_name='hour')\
                .dropna()\
                .drop(columns=['variable'])
df_expanded = df_expanded.sort_values(by=['date', 'city', 'condition', 'hour'])\
                        .ffill()

result = df_expanded.merge(df, on=['date', 'city', 'condition', 'hour'], how='left')\
                    .dropna()\
                    .drop_duplicates()

Open to easier and simpler solutions

It_is_Chris · Accepted Answer

It is easiest to ffill the missing data like below but I will try to also think of a solution for the closest time

# some sample data
d = {'date': ['2023-01-30', '2023-01-30', '2023-01-30', '2023-01-30', '2023-01-30', '2023-01-30'],
 'hour': [3, 6, 9, 3, 6, 9],
 'city': ['berlin', 'berlin', 'berlin', 'chicago', 'chicago', 'chicago'],
 'condition': ['snow', 'rain', 'clear', 'snow', 'snow', 'clear'],
 'snow': [1, 0, 0, 1, 1, 0],
 'rain': [0, 1, 0, 0, 0, 0]}

df = pd.DataFrame(d)

# convert to datetime and the hour to a timedelta and set as the index
df = df.set_index(pd.to_datetime(df['date']) + pd.to_timedelta(df['hour'], unit='h')).drop(columns=['date', 'hour'])
# groupby the city and resample to the hour and ffill the missing data
df.groupby('city').resample('h').ffill().reset_index(level=0, drop=True)

                        city condition  snow  rain
2023-01-30 03:00:00   berlin      snow     1     0
2023-01-30 04:00:00   berlin      snow     1     0
2023-01-30 05:00:00   berlin      snow     1     0
2023-01-30 06:00:00   berlin      rain     0     1
2023-01-30 07:00:00   berlin      rain     0     1
2023-01-30 08:00:00   berlin      rain     0     1
2023-01-30 09:00:00   berlin     clear     0     0
2023-01-30 03:00:00  chicago      snow     1     0
2023-01-30 04:00:00  chicago      snow     1     0
2023-01-30 05:00:00  chicago      snow     1     0
2023-01-30 06:00:00  chicago      snow     1     0
2023-01-30 07:00:00  chicago      snow     1     0
2023-01-30 08:00:00  chicago      snow     1     0
2023-01-30 09:00:00  chicago     clear     0     0

if you want the original columns of date and hour then add the following

new_df = df.groupby('city').resample('h').ffill().reset_index(level=0, drop=True)
new_df = new_df.reset_index().rename(columns={'index': 'date'})
new_df['hour'] = new_df['date'].dt.hour
new_df['date'] = new_df['date'].dt.date


          date     city condition  snow  rain  hour
0   2023-01-30   berlin      snow     1     0     3
1   2023-01-30   berlin      snow     1     0     4
2   2023-01-30   berlin      snow     1     0     5
3   2023-01-30   berlin      rain     0     1     6
4   2023-01-30   berlin      rain     0     1     7
5   2023-01-30   berlin      rain     0     1     8
6   2023-01-30   berlin     clear     0     0     9
7   2023-01-30  chicago      snow     1     0     3
8   2023-01-30  chicago      snow     1     0     4
9   2023-01-30  chicago      snow     1     0     5
10  2023-01-30  chicago      snow     1     0     6
11  2023-01-30  chicago      snow     1     0     7
12  2023-01-30  chicago      snow     1     0     8
13  2023-01-30  chicago     clear     0     0     9

Changing weather data frequency from 3 hours to 1 hour

Answers (1)

Related Questions