Reputation: 321
I have a large two column dataset representing a scattered functional behaviour. Say for each time value (x) there exist an amount of widely spread measurement values (y). I want to get for each time value (or thinking of histograms within certain time intervals) the average of the measurement values y in there. I was searching for rolling/moving averages and spline interpolations but I'm stuck. Below is a minimal example code of what should happen:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.interpolate import UnivariateSpline
#generate testdata which usually is read in from a huge file
def testdata(x):
return 1/(1+10.*x**2)
x = np.random.uniform(-1,1,1000)
y = testdata(x) + np.random.normal(0, 1, len(x))
#convert it to dataframes, as I usually work with them
df = pd.DataFrame(list(zip(x,y)))
#sort the x-values as they are randomly distributed in the dataset
df_new = df.sort_values(by=[0])
#show the data and how the (analytical average shhould look like)
plt.scatter(df_new[0],df_new[1],s=1)
plt.scatter(df_new[0],testdata(df_new[0]), s=1, c='r')
#try a spline - however it fails
spl = UnivariateSpline(df_new.iloc[:, 0], df_new.iloc[:, 1])
xs = np.linspace(-1, 1, 10000)
plt.plot(xs, spl(xs), 'g--', lw=3)
plt.show()
So blue is my data - red is what the averaged values should look like (in this test case I obviously know it) and green ist what the spline method would give me.
For sure someone of you knows a better method to achieve the red curve by a smart (built-in) algorithm?
Upvotes: 1
Views: 1019
Reputation: 338
You could round
the x
values then groupby
to get the average.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.interpolate import UnivariateSpline
# generate testdata
def testdata(x):
return 1/(1+10.*x**2)
# create x, y
x = np.random.uniform(-1,1,1000)
y = testdata(x) + np.random.normal(0, 1, len(x))
# convert to df and sort values inplace
df = pd.DataFrame({'x': x, 'y': y})
df.sort_values(by='x', inplace=True)
# round x values then group by to create bins
round_by = 2
bins = df.groupby(df.x.round(round_by)).mean()
# plot
fig, ax = plt.subplots()
ax.scatter(df.x, df.y, s=1)
plt.plot(bins.index, bins.y, 'g--', lw=3)
plt.show()
Upvotes: 1