klonq
klonq

Reputation: 3587

Computer graphing utilities

I have developed a system in R for graphing large datasets obtained from wind turbines. I am now porting the process into Java. The results I get between the two systems are inconsistent.

As shown below:

I can explain the discrepancies between the (red) calculated lines and that is due to the fact that I am using different calculation methods.

In R the data is processed as follows, I wrote this code with a little help and have no idea what is going on here (but hey, it works).

df <- data.frame(pwr = pwr, spd = spd)
require(mgcv)
mod <- gam(pwr ~ s(spd, bs = "ad", k = 20), data = df, method = "REML")
summary(mod)
x_grid <- with(df, data.frame(spd = seq(min(spd) + 0.0001, maxi, length=100)))
pred <- predict(mod, x_grid, se.fit = TRUE)
x_grid <- within(x_grid, fit <- pred$fit)
lines(fit ~ spd, data = x_grid, col = "red", lwd = thickLineWidth)

In Java (SQL infact) I am using the method of bins to calculate the average at every 0.5 on the x-axis. The resulting data is plotted using a org.jfree.chart.renderer.xy.XYSplineRenderer I do not know too much about how the line is rendered.

SELECT 
    ROUND( ROUND( x_data * 2 ) / 2, 1)   AS x_axis, # See https://stackoverflow.com/questions/5230647/mysql-rounding-functions
    AVG( y_data )                        AS y_axis 
FROM 
    table 
GROUP BY 
    x_axis

My take on the variance between the two graphs:

These are things that I would like to eliminate.

So in an effort to understand the difference between the two graphs I have a few questions:

Upvotes: 2

Views: 423

Answers (1)

Gavin Simpson
Gavin Simpson

Reputation: 174803

In the R code, you are (well I was when I showed the example) fitting an additive model to the power and speed data, where the relationship between the variables is determined from the data themselves. These models involve the use of splines to estimate the response function. In particular here I used an adaptive smoother with k = 20 the complexity of the smoother fitting. The more complex the smoother, the more wiggly the fitted function can be. An adaptive smoother is one where the degree of smoothness varies across the fitted function.

Why is this important? Well, from your data, there are periods where the response does not vary with the speed variable, and periods where the response changes rapidly with a change in speed. We have an "allowance" of wigglyness to use up over the curve. With ordinary splines the wigglyness (or smoothness) is the same across the entire function. With an adaptive smooth we can use more of our wigglyness allowance in the parts of the function where the response is changing/varying most, and not spend any of the allowance where it is not needed in the parts where the response isn't changing.

Below I annotate the code to explain what is being done at each step:

## here we create a data frame with the pwr and spd variables
df <- data.frame(pwr = pwr, spd = spd)

## we load the package containing the code to fit the additive model
require(mgcv)

## This is the model itself, saying pwr is modelled as a smooth function of spd
## and the smooth function of spd is generated using an adaptive smoother with
## and "allowance" of 20. This allowance is a starting point and the actual
## smoothness of the curve will be estimated as part of the model fitting,
## here using a REML criterion
mod <- gam(pwr ~ s(spd, bs = "ad", k = 20), data = df, method = "REML")

## This just summarise the model fit
summary(mod)

## In this line we are creating a new spd vector (in a data frame) that contains
## 100 equally spaced spd values over the entire range of the observed spd
x_grid <- with(df, data.frame(spd = seq(min(spd) + 0.0001, maxi, length=100)))

## we will use those data to get predictions of the response pwr at each
## of the 100 values of spd we just created
## I did this so we had enough data to plot a nice smooth curve, but without
## having to predict for all the observed values of spd
pred <- predict(mod, x_grid, se.fit = TRUE)

## This line stores the 100 predicted values in the prediction data object
x_grid <- within(x_grid, fit <- pred$fit)

## This line draws the fitted smooth on to a plot of the data
## this assumes there is already a plot on the active device.
lines(fit ~ spd, data = x_grid, col = "red", lwd = thickLineWidth)

If you are not familiar with additive models and smoothers/splines then I recommend Ruppert, Wand and Carroll (2003) Semiparametric Regression. Cambridge University Press.

Upvotes: 4

Related Questions