Reputation: 25
I want to apply a linear regression to rows in my data frame based on an ID number. I want to do this using the ch4_ppm column. For example, for all the rows that have 'measurement_id'==1, do a regression on this set and store the results in a data frame. I have a total of 157 sets I want to repeat this for. Below is a sample from my dataset.
structure(list(time = c("2022-07-26 09:11:40", "2022-07-26 09:11:45", "2022-07-26 09:11:50", "2022-07-26 09:11:55", "2022-07-26 09:12:00", "2022-07-26 09:12:05", "2022-07-26 09:12:10", "2022-07-26 09:12:15", "2022-07-26 09:12:20", "2022-07-26 09:12:25", "2022-07-26 09:12:30", "2022-07-26 09:12:35", "2022-07-26 09:12:40", "2022-07-26 09:12:45", "2022-07-26 09:12:50", "2022-07-26 09:12:55", "2022-07-26 09:13:00", "2022-07-26 09:13:05", "2022-07-26 09:13:10", "2022-07-26 09:13:15", "2022-07-26 09:13:20", "2022-07-26 09:13:25", "2022-07-26 09:13:30", "2022-07-26 09:13:35", "2022-07-26 09:13:40", "2022-07-26 09:13:45", "2022-07-26 09:13:50", "2022-07-26 09:13:55", "2022-07-26 09:14:00", "2022-07-26 09:14:05", "2022-07-26 09:14:10", "2022-07-26 09:14:15", "2022-07-26 09:14:20", "2022-07-26 09:14:25", "2022-07-26 09:14:30", "2022-07-26 09:14:35", "2022-07-26 09:14:40", "2022-07-26 09:20:55", "2022-07-26 09:21:00", "2022-07-26 09:21:05", "2022-07-26 09:21:10", "2022-07-26 09:21:15", "2022-07-26 09:21:20", "2022-07-26 09:21:25", "2022-07-26 09:21:30", "2022-07-26 09:21:35", "2022-07-26 09:21:40", "2022-07-26 09:21:45", "2022-07-26 09:21:50", "2022-07-26 09:21:55", "2022-07-26 09:22:00", "2022-07-26 09:22:05", "2022-07-26 09:22:10", "2022-07-26 09:22:15", "2022-07-26 09:22:20", "2022-07-26 09:22:25", "2022-07-26 09:22:30", "2022-07-26 09:22:35", "2022-07-26 09:22:40", "2022-07-26 09:22:45", "2022-07-26 09:22:50", "2022-07-26 09:22:55", "2022-07-26 09:23:00", "2022-07-26 09:23:05", "2022-07-26 09:23:10", "2022-07-26 09:23:15", "2022-07-26 09:23:20", "2022-07-26 09:23:25", "2022-07-26 09:23:30", "2022-07-26 09:28:10", "2022-07-26 09:28:15", "2022-07-26 09:28:20", "2022-07-26 09:28:25", "2022-07-26 09:28:30", "2022-07-26 09:28:35", "2022-07-26 09:28:40", "2022-07-26 09:28:45", "2022-07-26 09:28:50", "2022-07-26 09:28:55", "2022-07-26 09:29:00", "2022-07-26 09:29:05", "2022-07-26 09:29:10", "2022-07-26 09:29:15", "2022-07-26 09:29:20", "2022-07-26 09:29:25", "2022-07-26 09:29:30", "2022-07-26 09:29:35", "2022-07-26 09:29:40", "2022-07-26 09:29:45", "2022-07-26 09:29:50", "2022-07-26 09:29:55", "2022-07-26 09:30:00", "2022-07-26 09:30:05", "2022-07-26 09:30:10", "2022-07-26 09:30:15", "2022-07-26 09:30:20", "2022-07-26 09:30:25", "2022-07-26 09:30:30", "2022-07-26 09:30:35", "2022-07-26 09:30:40"), ch4_ppm = c(27.75,
36.32, 51.63, 54.56, 92, 121.5, 127.6, 193.7, 189.3, 279.2, 323.2,
368.6, 426.1, 485.7, 545.7, 492.8, 548.1, 751.6, 820, 892, 959, 1018, 1080, 1152, 1214, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
3.346, 3.672, 3.958, 8.33, 21.09, 47.68, 53.71, 150.8,
190.6, 222.4, 267.3, 326.4, 314.5, 375.8, 435.3, 595.7, 534.3,
740.6, 819, 873, 923, 987, 1067, 1140, 1206, NA, NA, NA, NA, NA, NA, NA, 14.71, 14.7, 29.95, 48.91, 75.21, 102, 129.9, 131.9,
199.3, 140, 272.2, 313.3, 354.1, 398.4, 449.3, 503.6, 557.9,
614.2, 676, 734.4, 628.6, 672.4, 536.2, 767, 1002, 1054, 1099, 1143, 1194, NA, NA), co2_ppm = c(491, 500.3, 509.1, 517.7, 526.6,
530.3, 534.7, 536.1, 539, 542.5, 549.8, 557.7, 570.5, 584.4,
602.6, 622.9, 648.3, 674.4, 705.5, 736.2, 772.3, 807, 846, 883, 927, 967, 1010, 1049, 1094, 1135, 1181, 1226, 1268, 1313, 1359, 1403, 1445, 426.7, 427.2, 427, 428.5, 428.5, 429.8, 430.5, 434.4,
438.4, 444.4, 452.9, 464.3, 477.2, 493.1, 510.4, 532.7, 556,
584.6, 613.6, 648.9, 681.6, 719, 753.5, 794.5, 831, 874, 917, 958, 1001, 1045, 1090, 1131, 439.7, 436.8, 436, 435.3, 436.9,
438.7, 443.4, 450.5, 457.3, 467.6, 480.3, 495.3, 512.9, 532.9,
554.9, 578.3, 605.7, 633.1, 664.3, 692.8, 727.2, 758.5, 794.7, 828, 868, 903, 945, 983, 1023, 1062, 1104), chamber_temp_degC = c(20.85,
20.86, 20.87, 20.88, 20.89, 20.9, 20.91, 20.91, 20.93, 20.94,
20.96, 20.96, 20.97, 20.98, 20.99, 20.99, 21, 21.01, 21.02, 21.03,
21.03, 21.05, 21.06, 21.06, 21.07, 21.08, 21.09, 21.09, 21.09,
21.1, 21.11, 21.11, 21.12, 21.13, 21.13, 21.14, 21.15, 21.6,
21.6, 21.61, 21.61, 21.62, 21.63, 21.63, 21.64, 21.65, 21.67,
21.67, 21.67, 21.67, 21.68, 21.68, 21.69, 21.69, 21.7, 21.71,
21.71, 21.72, 21.72, 21.73, 21.73, 21.74, 21.74, 21.75, 21.75,
21.76, 21.76, 21.76, 21.77, 22.1, 22.12, 22.13, 22.14, 22.15,
22.16, 22.17, 22.18, 22.19, 22.19, 22.2, 22.2, 22.21, 22.21,
22.22, 22.22, 22.23, 22.23, 22.24, 22.24, 22.25, 22.26, 22.26,
22.26, 22.27, 22.27, 22.27, 22.27, 22.27, 22.27, 22.27), measurement_id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)), row.names = c(21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L, 55L, 56L, 57L, 132L, 133L, 134L, 135L, 136L, 137L, 138L, 139L, 140L, 141L, 142L, 143L, 144L, 145L, 146L, 147L, 148L, 149L, 150L, 151L, 152L, 153L, 154L, 155L, 156L, 157L, 158L, 159L, 160L, 161L, 162L, 163L, 219L, 220L, 221L, 222L, 223L, 224L, 225L, 226L, 227L, 228L, 229L, 230L, 231L, 232L, 233L, 234L, 235L, 236L, 237L, 238L, 239L, 240L, 241L, 242L, 243L, 244L, 245L, 246L, 247L, 248L, 249L), class = "data.frame")
I want the output to be a data frame of the calculated slope next to the start time of each sequence (the timestamp in the first row of each ID) next to the calculated slope from the regression like below
time slope measurement_id
2022-09-18 8.67 1
2022-09-18 5.34 2
... ...
Any help would be greatly appreciated :)
Upvotes: 1
Views: 239
Reputation: 146
Here's a solution avoiding tidyverse and dealing with the formatting of the time, assuming you are interested in time as a continuous variable measured in seconds. Although, it isn't clear from the question what units you want to measure time in. Saving your data as a data frame called df
.
df$time <- lubridate::ymd_hms(df$time)
# function to subset data, calculate time difference in seconds,
# and get linear regression slope
getSlope <- function(id){
data <- df[df$measurement_id==id,]
data$t <- data$time - min(data$time)
fit <- lm(ch4_ppm ~ t, data=data)
return(fit$coefficients['t'])
}
# new data frame to hold results
results <- data.frame(id = 1:3, slope=NA)
results$slope <- sapply(1:3,getSlope)
results
#> id slope
#>1 1 10.365175
#>2 2 10.735887
#>3 3 8.465798
Upvotes: 2
Reputation: 41469
You could group_by
"measurement_id" and add a column with is the row id per group using row_number
and summarise
the slope which is in the coefficients
of lm
like this:
library(dplyr)
df %>%
group_by(measurement_id) %>%
mutate(rowid = row_number()) %>%
summarise(time = first(time),
slope = lm(ch4_ppm ~ rowid)$coefficients['rowid'])
#> # A tibble: 3 × 3
#> measurement_id time slope
#> <dbl> <chr> <dbl>
#> 1 1 2022-07-26 09:11:40 51.8
#> 2 2 2022-07-26 09:20:55 53.7
#> 3 3 2022-07-26 09:28:10 42.3
Created on 2022-10-12 with reprex v2.0.2
Upvotes: 2