Nikolaij
Nikolaij

Reputation: 321

Finding the mean value (or rolling average) of a scattered dataset with Python

I have a large two column dataset representing a scattered functional behaviour. Say for each time value (x) there exist an amount of widely spread measurement values (y). I want to get for each time value (or thinking of histograms within certain time intervals) the average of the measurement values y in there. I was searching for rolling/moving averages and spline interpolations but I'm stuck. Below is a minimal example code of what should happen:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.interpolate import UnivariateSpline

#generate testdata which usually is read in from a huge file
def testdata(x):
    return 1/(1+10.*x**2)

x = np.random.uniform(-1,1,1000)
y = testdata(x) + np.random.normal(0, 1, len(x))

#convert it to dataframes, as I usually work with them
df = pd.DataFrame(list(zip(x,y)))

#sort the x-values as they are randomly distributed in the dataset
df_new = df.sort_values(by=[0])

#show the data and how the (analytical average shhould look like)
plt.scatter(df_new[0],df_new[1],s=1)
plt.scatter(df_new[0],testdata(df_new[0]), s=1, c='r')

#try a spline - however it fails
spl = UnivariateSpline(df_new.iloc[:, 0], df_new.iloc[:, 1])
xs = np.linspace(-1, 1, 10000)
plt.plot(xs, spl(xs), 'g--', lw=3)

plt.show()

output of script above

So blue is my data - red is what the averaged values should look like (in this test case I obviously know it) and green ist what the spline method would give me.

For sure someone of you knows a better method to achieve the red curve by a smart (built-in) algorithm?

Upvotes: 1

Views: 1019

Answers (1)

Ali
Ali

Reputation: 338

You could round the x values then groupby to get the average.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.interpolate import UnivariateSpline

# generate testdata
def testdata(x):
    return 1/(1+10.*x**2)

# create x, y
x = np.random.uniform(-1,1,1000)
y = testdata(x) + np.random.normal(0, 1, len(x))

# convert to df and sort values inplace
df = pd.DataFrame({'x': x, 'y': y})
df.sort_values(by='x', inplace=True)

# round x values then group by to create bins
round_by = 2
bins = df.groupby(df.x.round(round_by)).mean()

# plot
fig, ax = plt.subplots()

ax.scatter(df.x, df.y, s=1)
plt.plot(bins.index, bins.y, 'g--', lw=3)

plt.show()

enter image description here

Upvotes: 1

Related Questions