Perform linear regression on rows of data based on ID in R

Question

I want to apply a linear regression to rows in my data frame based on an ID number. I want to do this using the ch4_ppm column. For example, for all the rows that have 'measurement_id'==1, do a regression on this set and store the results in a data frame. I have a total of 157 sets I want to repeat this for. Below is a sample from my dataset.

structure(list(time = c("2022-07-26 09:11:40", "2022-07-26 09:11:45",  "2022-07-26 09:11:50", "2022-07-26 09:11:55", "2022-07-26 09:12:00",  "2022-07-26 09:12:05", "2022-07-26 09:12:10", "2022-07-26 09:12:15",  "2022-07-26 09:12:20", "2022-07-26 09:12:25", "2022-07-26 09:12:30",  "2022-07-26 09:12:35", "2022-07-26 09:12:40", "2022-07-26 09:12:45",  "2022-07-26 09:12:50", "2022-07-26 09:12:55", "2022-07-26 09:13:00",  "2022-07-26 09:13:05", "2022-07-26 09:13:10", "2022-07-26 09:13:15",  "2022-07-26 09:13:20", "2022-07-26 09:13:25", "2022-07-26 09:13:30",  "2022-07-26 09:13:35", "2022-07-26 09:13:40", "2022-07-26 09:13:45",  "2022-07-26 09:13:50", "2022-07-26 09:13:55", "2022-07-26 09:14:00",  "2022-07-26 09:14:05", "2022-07-26 09:14:10", "2022-07-26 09:14:15",  "2022-07-26 09:14:20", "2022-07-26 09:14:25", "2022-07-26 09:14:30",  "2022-07-26 09:14:35", "2022-07-26 09:14:40", "2022-07-26 09:20:55",  "2022-07-26 09:21:00", "2022-07-26 09:21:05", "2022-07-26 09:21:10",  "2022-07-26 09:21:15", "2022-07-26 09:21:20", "2022-07-26 09:21:25",  "2022-07-26 09:21:30", "2022-07-26 09:21:35", "2022-07-26 09:21:40",  "2022-07-26 09:21:45", "2022-07-26 09:21:50", "2022-07-26 09:21:55",  "2022-07-26 09:22:00", "2022-07-26 09:22:05", "2022-07-26 09:22:10",  "2022-07-26 09:22:15", "2022-07-26 09:22:20", "2022-07-26 09:22:25",  "2022-07-26 09:22:30", "2022-07-26 09:22:35", "2022-07-26 09:22:40",  "2022-07-26 09:22:45", "2022-07-26 09:22:50", "2022-07-26 09:22:55",  "2022-07-26 09:23:00", "2022-07-26 09:23:05", "2022-07-26 09:23:10",  "2022-07-26 09:23:15", "2022-07-26 09:23:20", "2022-07-26 09:23:25",  "2022-07-26 09:23:30", "2022-07-26 09:28:10", "2022-07-26 09:28:15",  "2022-07-26 09:28:20", "2022-07-26 09:28:25", "2022-07-26 09:28:30",  "2022-07-26 09:28:35", "2022-07-26 09:28:40", "2022-07-26 09:28:45",  "2022-07-26 09:28:50", "2022-07-26 09:28:55", "2022-07-26 09:29:00",  "2022-07-26 09:29:05", "2022-07-26 09:29:10", "2022-07-26 09:29:15",  "2022-07-26 09:29:20", "2022-07-26 09:29:25", "2022-07-26 09:29:30",  "2022-07-26 09:29:35", "2022-07-26 09:29:40", "2022-07-26 09:29:45",  "2022-07-26 09:29:50", "2022-07-26 09:29:55", "2022-07-26 09:30:00",  "2022-07-26 09:30:05", "2022-07-26 09:30:10", "2022-07-26 09:30:15",  "2022-07-26 09:30:20", "2022-07-26 09:30:25", "2022-07-26 09:30:30",  "2022-07-26 09:30:35", "2022-07-26 09:30:40"), ch4_ppm = c(27.75, 
36.32, 51.63, 54.56, 92, 121.5, 127.6, 193.7, 189.3, 279.2, 323.2, 
368.6, 426.1, 485.7, 545.7, 492.8, 548.1, 751.6, 820, 892, 959,  1018, 1080, 1152, 1214, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,  NA, NA,
3.346, 3.672, 3.958, 8.33, 21.09, 47.68, 53.71, 150.8, 
190.6, 222.4, 267.3, 326.4, 314.5, 375.8, 435.3, 595.7, 534.3, 
740.6, 819, 873, 923, 987, 1067, 1140, 1206, NA, NA, NA, NA,  NA, NA, NA, 14.71, 14.7, 29.95, 48.91, 75.21, 102, 129.9, 131.9, 
199.3, 140, 272.2, 313.3, 354.1, 398.4, 449.3, 503.6, 557.9, 
614.2, 676, 734.4, 628.6, 672.4, 536.2, 767, 1002, 1054, 1099,  1143, 1194, NA, NA), co2_ppm = c(491, 500.3, 509.1, 517.7, 526.6, 
530.3, 534.7, 536.1, 539, 542.5, 549.8, 557.7, 570.5, 584.4, 
602.6, 622.9, 648.3, 674.4, 705.5, 736.2, 772.3, 807, 846, 883,  927, 967, 1010, 1049, 1094, 1135, 1181, 1226, 1268, 1313, 1359,  1403, 1445, 426.7, 427.2, 427, 428.5, 428.5, 429.8, 430.5, 434.4, 
438.4, 444.4, 452.9, 464.3, 477.2, 493.1, 510.4, 532.7, 556, 
584.6, 613.6, 648.9, 681.6, 719, 753.5, 794.5, 831, 874, 917,  958, 1001, 1045, 1090, 1131, 439.7, 436.8, 436, 435.3, 436.9, 
438.7, 443.4, 450.5, 457.3, 467.6, 480.3, 495.3, 512.9, 532.9, 
554.9, 578.3, 605.7, 633.1, 664.3, 692.8, 727.2, 758.5, 794.7,  828, 868, 903, 945, 983, 1023, 1062, 1104), chamber_temp_degC = c(20.85, 
20.86, 20.87, 20.88, 20.89, 20.9, 20.91, 20.91, 20.93, 20.94, 
20.96, 20.96, 20.97, 20.98, 20.99, 20.99, 21, 21.01, 21.02, 21.03, 
21.03, 21.05, 21.06, 21.06, 21.07, 21.08, 21.09, 21.09, 21.09, 
21.1, 21.11, 21.11, 21.12, 21.13, 21.13, 21.14, 21.15, 21.6, 
21.6, 21.61, 21.61, 21.62, 21.63, 21.63, 21.64, 21.65, 21.67, 
21.67, 21.67, 21.67, 21.68, 21.68, 21.69, 21.69, 21.7, 21.71, 
21.71, 21.72, 21.72, 21.73, 21.73, 21.74, 21.74, 21.75, 21.75, 
21.76, 21.76, 21.76, 21.77, 22.1, 22.12, 22.13, 22.14, 22.15, 
22.16, 22.17, 22.18, 22.19, 22.19, 22.2, 22.2, 22.21, 22.21, 
22.22, 22.22, 22.23, 22.23, 22.24, 22.24, 22.25, 22.26, 22.26, 
22.26, 22.27, 22.27, 22.27, 22.27, 22.27, 22.27, 22.27), measurement_id = c(1,  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2,  2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,  2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,  3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3)), row.names = c(21L,  22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L,  35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L,  48L, 49L, 50L, 51L, 52L, 53L, 54L, 55L, 56L, 57L, 132L, 133L,  134L, 135L, 136L, 137L, 138L, 139L, 140L, 141L, 142L, 143L, 144L,  145L, 146L, 147L, 148L, 149L, 150L, 151L, 152L, 153L, 154L, 155L,  156L, 157L, 158L, 159L, 160L, 161L, 162L, 163L, 219L, 220L, 221L,  222L, 223L, 224L, 225L, 226L, 227L, 228L, 229L, 230L, 231L, 232L,  233L, 234L, 235L, 236L, 237L, 238L, 239L, 240L, 241L, 242L, 243L,  244L, 245L, 246L, 247L, 248L, 249L), class = "data.frame")

I want the output to be a data frame of the calculated slope next to the start time of each sequence (the timestamp in the first row of each ID) next to the calculated slope from the regression like below

time          slope    measurement_id
2022-09-18    8.67     1
2022-09-18    5.34     2
...           ...

Any help would be greatly appreciated :)

Quinten · Accepted Answer

You could group_by "measurement_id" and add a column with is the row id per group using row_number and summarise the slope which is in the coefficients of lm like this:

library(dplyr)
df %>% 
  group_by(measurement_id) %>% 
  mutate(rowid = row_number()) %>% 
  summarise(time = first(time), 
            slope = lm(ch4_ppm ~ rowid)$coefficients['rowid'])
#> # A tibble: 3 × 3
#>   measurement_id time                slope
#>                            
#> 1              1 2022-07-26 09:11:40  51.8
#> 2              2 2022-07-26 09:20:55  53.7
#> 3              3 2022-07-26 09:28:10  42.3

^{Created on 2022-10-12 with reprex v2.0.2}

Perform linear regression on rows of data based on ID in R

Answers (2)

Related Questions