How to smooth data of increasing noise

Question

Chemist here (so not very good with statistical analysis) and novice in R:

I have various sets of data where the yield of a reaction is monitored with time such as:

The data:

df <- structure(list(time = c(15, 30, 45, 60, 75, 90, 105, 120, 135, 150, 165, 180, 195, 210, 225, 240, 255, 270, 285, 300, 315, 330, 345, 360, 375, 390, 405, 420, 435, 450, 465, 480, 495, 510, 525, 540, 555, 570, 585, 600, 615, 630, 645, 660, 675, 690, 705, 720, 735, 750, 765, 780, 795, 810, 825, 840, 855, 870, 885, 900, 915, 930, 945, 960, 975, 990, 1005, 1020, 1035, 1050, 1065, 1080, 1095, 1110, 1125, 1140, 1155, 1170, 1185, 1200, 1215, 1230, 1245, 1260, 1275, 1290, 1305, 1320, 1335, 1350, 1365, 1380, 1395, 1410, 1425, 1440, 1455, 1470, 1485, 1500, 1515, 1530, 1545, 1560, 1575, 1590, 1605, 1620, 1635, 1650, 1665, 1680, 1695, 1710, 1725, 1740, 1755, 1770, 1785, 1800, 1815, 1830, 1845, 1860, 1875, 1890, 1905, 1920, 1935, 1950, 1965, 1980, 1995, 2010, 2025, 2040, 2055, 2070, 2085, 2100, 2115, 2130), yield = c(9.3411, 9.32582, 10.5475, 13.5358, 17.3376, 16.7444, 20.7234, 19.8374, 24.327, 27.4162, 27.38, 31.3926, 29.3289, 32.2556, 33.0025, 35.3358, 35.8986, 40.1859, 40.3886, 42.2828, 41.23, 43.8108, 43.9391, 43.9543, 48.0524, 47.8295, 48.674, 48.2456, 50.2641, 50.7147, 49.6828, 52.8877, 51.7906, 57.2553, 53.6175, 57.0186, 57.6598, 56.4049, 57.1446, 58.5464, 60.7213, 61.0584, 57.7481, 59.9151, 64.475, 61.2322, 63.5167, 64.6289, 64.4245, 62.0048, 65.5821, 65.8275, 65.7584, 68.0523, 65.4874, 68.401, 68.1503, 67.8713, 69.5478, 69.9774, 73.4199, 66.7266, 70.4732, 67.5119, 69.6107, 70.4911, 72.7592, 69.3821, 72.049, 70.2548, 71.6336, 70.6215, 70.8611, 72.0337, 72.2842, 76.0792, 75.2526, 72.7016, 73.6547, 75.6202, 76.5013, 74.2459, 76.033, 78.4803, 76.3058, 73.837, 74.795, 76.2126, 75.1816, 75.3594, 79.9158, 77.8157, 77.8152, 75.3712, 78.3249, 79.1198, 77.6184, 78.1244, 78.1741, 77.9305, 79.7576, 78.0261, 79.8136, 75.5314, 80.2177, 79.786, 81.078, 78.4183, 80.8013, 79.3855, 81.5268, 78.416, 78.9021, 79.9394, 80.8221, 81.241, 80.6111, 79.7504, 81.6001, 80.7021, 81.1008, 82.843, 82.2716, 83.024, 81.0381, 80.0248, 85.1418, 83.1229, 83.3334, 83.2149, 84.836, 79.5156, 81.909, 81.1477, 85.1715, 83.7502, 83.8336, 83.7595, 86.0062, 84.9572, 86.6709, 84.4124)), .Names = c("time", "yield"), row.names = c(NA, -142L), class = "data.frame")
What i want to do to the data:

I need to smooth the data in order to plot the 1st derivative. In the paper the author mentioned that one can fit a high order polynomial and use that to do the processing which i think is wrong since we dont really know the true relationship between time and yield for the data and is definitely not polyonymic. I tried regardless and the plot of the derivative did not make any chemical sense as expected. Next i looked into loess using: loes<-loess(Yield~Time,data=df,span=0.9) which gave a much better fit. However, the best results so far was using :

spl <- smooth.spline(df$Time, y=df$Yield,cv=TRUE)
colnames(predspl)<-c('Time','Yield')
pred.der<-as.data.frame(predict(spl, deriv=1))
colnames(pred.der)<-c('Time', 'Yield')

which gave the best fit especially in the initial data points (by visual inspection).

The problem i have:

The issue however is that the derivative looks really good only up to t=500s and then it starts wiggling more and more towards the end. This shouldnt happen from a chemistry point of view and it is just a result of overfitting towards the end of the data due to the increase of the noise. I know this since for some experiments that i have performed 3 times and averaged the data (so the noise decreased) the wiggling is much smaller in the plot of the derivative.

What i have tried so far:

I tried different values of spar which although it smoothens correctly the later data it causes a poor fit in the initial data (which are the most important). I also tried to reduce the number of knots but i got a similar result with the one from changing the spar value. What i think i need is to have a larger amount of knots in the begining which will smoothly decrease to a small number of knots towards the end to avoid that overfitting.

The question:

Is my reasoning correct here? Does anyone know how can i have the above effect in order to get a smooth derivative without any wiggling? Do i need to try a different fit other than the spline maybe? I have attached a pic in the end where you can see the derivative from the smooth.spline vs time and a black line (drawn by hand) of what it should look like. Thank you for your help in advance.

How to smooth data of increasing noise

Answers (1)

Related Questions