Reputation: 440
I have two dataframes where both are day-count data with respect to some start time. One dataframe is a signal data sampled at a higher rate compared to the other dataframe signal data. The overall time frame is same for both the signal dataframes. My question is what approach can I use to downsample the one with more rows to match the number of rows. the data frame looks like these: (where the first column is day count from a particular start time)
Days Data at 1hz
0 0.958 83.63
1 0.958 83.08
2 0.958 82.45
3 0.958 81.83
4 0.959 81.18
Days Data at 4hz
0 0.958 0.028
1 0.958 0.028
2 0.958 0.027
3 0.958 0.029
4 0.958 0.028
I have tried the pandas.resample()
function but apparently it works for only timestamp index. And also rolling()
function makes the other values over a window = 8 (diff in sampling ratio of data1 and data2) as Nan
and does not remove the rows.
Is there a way, I can apply mean()
or some other approach to make them both of same rows by removing some rows ?
Upvotes: 3
Views: 1885
Reputation: 1205
an alternative option is to use groupby:
import pandas as pd
df = pd.DataFrame({'days': [0.958] * 100, 'data 4Hz': [*range(100)]})
df
to go from 4 hz -> 1 hz we will downsample using a window of 4, and we use a mean here to downsample, but you can use anything that works with groupby.
window = 4
df_1hz = df.groupby((df.index/window).astype('i')).mean()
df_1hz.head()
Upvotes: 0
Reputation: 29635
Using rolling
is a good idea what you are missing is a dropna
after to get rid of the extra rows. Here an example, with 3 rows per Date in df1
and 8 rows per Date in df2
:
df1 = pd.DataFrame({'Date':[0.958]*3 + [0.959]*3,
'Data_1': [83.63,83.08,82.45,81.83,82.76,84.97]})
df2 = pd.DataFrame({'Date':[0.958]*8 + [0.959]*8,
'Data_4': [0.028,0.028,0.027,0.029]*4})
Now, you want to remove 5 rows in df2
per Date to get only 3 rows as in df1
, you get it with groupby
, rolling
and dropna
. Function mean
can be changed if you want something else, and reset_index
is more for cosmetic.
df3 = df2.groupby('Date').rolling(window=6).mean().dropna().reset_index(drop=True)
which gives the result for df3
Data_4 Date
0 0.028000 0.958
1 0.027833 0.958
2 0.028000 0.958
3 0.028000 0.959
4 0.027833 0.959
5 0.028000 0.959
with only 3 rows per Date from data in df2
.
Note that in the rolling
, window = 6
, which is the difference of rows per Date between df2 and df1 plus 1, otherwise you get one extra row.
Upvotes: 1