Reputation: 1597
I have 2 datasets that I want to use to do a regression on. prep_data is a list of items each with a value for 24 hours:
device_id h city mean_t
1 2C7970 00 Auckland 18.60507
2 2C7970 01 Auckland 18.56266
3 2C7970 02 Auckland 18.46109
4 2C7970 03 Auckland 18.38914
5 2C7970 04 Auckland 18.29095
6 2C7970 05 Auckland 18.28618
7 2C7970 06 Auckland 18.28742
8 2C7970 07 Auckland 18.28725
9 2C7970 08 Auckland 18.34685
10 2C7970 09 Auckland 18.52965
11 2C7970 10 Auckland 18.78257
12 2C7970 11 Auckland 19.05493
13 2C7970 12 Auckland 19.13627
14 2C7970 13 Auckland 19.30245
15 2C7970 14 Auckland 19.49134
16 2C7970 15 Auckland 19.49379
17 2C7970 16 Auckland 19.42745
18 2C7970 17 Auckland 19.30474
19 2C7970 18 Auckland 19.25752
20 2C7970 19 Auckland 19.15033
21 2C7970 20 Auckland 19.02484
22 2C7970 21 Auckland 18.88421
23 2C7970 22 Auckland 18.79243
24 2C7970 23 Auckland 18.70263
25 2C797D 00 Auckland 16.35536
26 2C797D 01 Auckland 16.13241
27 2C797D 02 Auckland 15.91293
28 2C797D 03 Auckland 15.64948
29 2C797D 04 Auckland 15.34483
30 2C797D 05 Auckland 15.19293
31 2C797D 06 Auckland 15.16948
32 2C797D 07 Auckland 15.23397
33 2C797D 08 Auckland 15.69017
34 2C797D 09 Auckland 16.18345
35 2C797D 10 Auckland 16.60586
36 2C797D 11 Auckland 16.80259
37 2C797D 12 Auckland 16.92842
38 2C797D 13 Auckland 17.31712
39 2C797D 14 Auckland 17.57376
40 2C797D 15 Auckland 17.63069
41 2C797D 16 Auckland 17.72552
42 2C797D 17 Auckland 17.69621
43 2C797D 18 Auckland 17.56276
44 2C797D 19 Auckland 17.45534
45 2C797D 20 Auckland 17.29776
46 2C797D 21 Auckland 17.07296
47 2C797D 22 Auckland 16.83681
48 2C797D 23 Auckland 16.65069
49 2C7985 00 Auckland 17.29410
50 2C7985 01 Auckland 16.88867
51 2C7985 02 Auckland 16.71007
52 2C7985 03 Auckland 16.54133
53 2C7985 04 Auckland 16.27081
54 2C7985 05 Auckland 16.02517
55 2C7985 06 Auckland 15.98891
56 2C7985 07 Auckland 15.93858
57 2C7985 08 Auckland 15.87868
58 2C7985 09 Auckland 15.91705
59 2C7985 10 Auckland 16.57119
60 2C7985 11 Auckland 17.14488
61 2C7985 12 Auckland 17.53762
62 2C7985 13 Auckland 18.23825
63 2C7985 14 Auckland 18.69074
64 2C7985 15 Auckland 18.90296
65 2C7985 16 Auckland 19.11993
66 2C7985 17 Auckland 19.10441
67 2C7985 18 Auckland 18.76062
68 2C7985 19 Auckland 18.52761
69 2C7985 20 Auckland 18.33885
70 2C7985 21 Auckland 17.98059
71 2C7985 22 Auckland 17.66859
72 2C7985 23 Auckland 17.58604
73 2C7A55 00 Auckland 17.82138
74 2C7A55 01 Auckland 17.67769
75 2C7A55 02 Auckland 17.43013
76 2C7A55 03 Auckland 17.28240
77 2C7A55 04 Auckland 17.13849
78 2C7A55 05 Auckland 16.93355
79 2C7A55 06 Auckland 16.71875
80 2C7A55 07 Auckland 16.63651
81 2C7A55 08 Auckland 16.58069
82 2C7A55 09 Auckland 16.60646
83 2C7A55 10 Auckland 16.70066
84 2C7A55 11 Auckland 16.82496
85 2C7A55 12 Auckland 17.16405
86 2C7A55 13 Auckland 17.52680
87 2C7A55 14 Auckland 17.78660
88 2C7A55 15 Auckland 18.03415
89 2C7A55 16 Auckland 18.26683
90 2C7A55 17 Auckland 18.39262
91 2C7A55 18 Auckland 18.41612
92 2C7A55 19 Auckland 18.39901
93 2C7A55 20 Auckland 18.35181
94 2C7A55 21 Auckland 18.25581
95 2C7A55 22 Auckland 18.11842
96 2C7A55 23 Auckland 17.94852
And the table to bind looks like this:
city_name h outdoor_hum outdoor_temp
1 Auckland 00 87.00000 11.62608
2 Auckland 01 86.87500 11.52487
3 Auckland 02 87.10596 11.34609
4 Auckland 03 87.82895 11.29961
5 Auckland 04 88.20395 11.15151
6 Auckland 05 88.07237 11.06513
7 Auckland 06 88.80921 10.96934
8 Auckland 07 88.46053 10.96316
9 Auckland 08 87.50658 11.59500
10 Auckland 09 85.03947 12.58395
11 Auckland 10 79.73684 13.80546
12 Auckland 11 75.23684 14.79230
13 Auckland 12 73.01974 15.25612
14 Auckland 13 71.24837 15.50320
15 Auckland 14 69.88889 15.62948
16 Auckland 15 70.81699 15.51739
17 Auckland 16 71.79739 15.14314
18 Auckland 17 74.93464 14.30928
19 Auckland 18 79.57516 13.50771
20 Auckland 19 81.19608 13.03699
21 Auckland 20 82.69281 12.56595
22 Auckland 21 84.43791 12.28150
23 Auckland 22 85.07843 12.05542
24 Auckland 23 85.75163 11.82458
25 Wellington 00 87.00000 11.62608
26 Wellington 01 86.87500 11.52487
27 Wellington 02 87.10596 11.34609
.. Wellington 03 87.82895 11.29961
.. Wellington 04 88.20395 11.15151
.. Wellington 05 88.07237 11.06513
.. Wellington 06 88.80921 10.96934
.. Wellington 07 88.46053 10.96316
.. Wellington 08 87.50658 11.59500
.. Wellington 09 85.03947 12.58395
.. Wellington 10 79.73684 13.80546
.. Wellington 11 75.23684 14.79230
.. Wellington 12 73.01974 15.25612
.. Wellington 13 71.24837 15.50320
.. Wellington 14 69.88889 15.62948
.. Wellington 15 70.81699 15.51739
.. Wellington 16 71.79739 15.14314
.. Wellington 17 74.93464 14.30928
.. Wellington 18 79.57516 13.50771
.. Wellington 19 81.19608 13.03699
.. Wellington 20 82.69281 12.56595
.. Wellington 21 84.43791 12.28150
.. Wellington 22 85.07843 12.05542
.. Wellington 23 85.75163 11.82458
For each unique device_id in the top data.table, I need to do a regression of the mean_t against the outdoor_temp for the matching city as given in the second data.table. Crucially, some of the items need to use the Wellington weather set, and some the Auckland set. ACtually in the full set there cold be dozens of locations. My thinking is that if I can use rbind to output a format something like:
h 2C7985 2C7985_outdoor 2C797D 2C797D_outdoor ... 0 1 2 ... 23
where the _outdoor is the data from the matching city. Then I can do a regression test by searching for each heading that doesn't contain _outdoor and compare it to the corresponding column that does... Here's the code for the case where there is only one city:
table_4 <- c()
# for each column (except h and "Auckland")
for (i in names(reg_data)[!(names(reg_data) %in% c("h", "Auckland"))]) {
# make a line of linear model comparing the column to the weather.
model <- lm(reg_data[[i]] ~ Auckland, data = reg_data)
# bind this new chunk of model action to the table thingy.
table_4 <- rbind(table_4_1_1, dust(model) %>% as.tibble())
}
It's a tricky one, but I reckon there's an R guru out there willing to help this poor old embedded C coder out!
> dput(reg_data[1:48, ])
structure(list(device_id = c("2C7970", "2C7970", "2C7970", "2C7970",
"2C7970", "2C7970", "2C7970", "2C7970", "2C7970", "2C7970", "2C7970",
"2C7970", "2C7970", "2C7970", "2C7970", "2C7970", "2C7970", "2C7970",
"2C7970", "2C7970", "2C7970", "2C7970", "2C7970", "2C7970", "2C797D",
"2C797D", "2C797D", "2C797D", "2C797D", "2C797D", "2C797D", "2C797D",
"2C797D", "2C797D", "2C797D", "2C797D", "2C797D", "2C797D", "2C797D",
"2C797D", "2C797D", "2C797D", "2C797D", "2C797D", "2C797D", "2C797D",
"2C797D", "2C797D"), h = c("00", "01", "02", "03", "04", "05",
"06", "07", "08", "09", "10", "11", "12", "13", "14", "15", "16",
"17", "18", "19", "20", "21", "22", "23", "00", "01", "02", "03",
"04", "05", "06", "07", "08", "09", "10", "11", "12", "13", "14",
"15", "16", "17", "18", "19", "20", "21", "22", "23"), city = c("Auckland",
"Auckland", "Auckland", "Auckland", "Auckland", "Auckland", "Auckland",
"Auckland", "Auckland", "Auckland", "Auckland", "Auckland", "Auckland",
"Auckland", "Auckland", "Auckland", "Auckland", "Auckland", "Auckland",
"Auckland", "Auckland", "Auckland", "Auckland", "Auckland", "Auckland",
"Auckland", "Auckland", "Auckland", "Auckland", "Auckland", "Auckland",
"Auckland", "Auckland", "Auckland", "Auckland", "Auckland", "Auckland",
"Auckland", "Auckland", "Auckland", "Auckland", "Auckland", "Auckland",
"Auckland", "Auckland", "Auckland", "Auckland", "Auckland"),
mean_t = c(18.6050736497545, 18.5626644736842, 18.4610927152318,
18.3891447368421, 18.2909539473684, 18.2861842105263, 18.287417218543,
18.2872516556291, 18.3468543046358, 18.5296540362438, 18.7825657894737,
19.0549342105263, 19.1362745098039, 19.3024509803922, 19.491339869281,
19.4937908496732, 19.4274509803922, 19.3047385620915, 19.2575163398693,
19.1503267973856, 19.0248366013072, 18.8842105263158, 18.7924342105263,
18.7026315789474, 16.3553633217993, 16.1324137931034, 15.9129310344828,
15.6494827586207, 15.3448275862069, 15.1929310344828, 15.1694827586207,
15.2339655172414, 15.6901724137931, 16.1834482758621, 16.6058620689655,
16.8025862068966, 16.9284246575342, 17.3171232876712, 17.573756432247,
17.6306896551724, 17.7255172413793, 17.6962068965517, 17.5627586206897,
17.4553448275862, 17.2977586206897, 17.0729636048527, 16.8368055555556,
16.6506944444444)), row.names = c(NA, -48L), groups = structure(list(
device_id = c("2C7970", "2C797D"), .rows = structure(list(
1:24, 25:48), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
dput(Hourly_Weather)
structure(list(city_name = c("Auckland", "Auckland", "Auckland",
"Auckland", "Auckland", "Auckland", "Auckland", "Auckland", "Auckland",
"Auckland", "Auckland", "Auckland", "Auckland", "Auckland", "Auckland",
"Auckland", "Auckland", "Auckland", "Auckland", "Auckland", "Auckland",
"Auckland", "Auckland", "Auckland"), h = c("00", "01", "02",
"03", "04", "05", "06", "07", "08", "09", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23"),
outdoor_hum = c(87, 86.875, 87.1059602649007, 87.8289473684211,
88.2039473684211, 88.0723684210526, 88.8092105263158, 88.4605263157895,
87.5065789473684, 85.0394736842105, 79.7368421052632, 75.2368421052632,
73.0197368421053, 71.2483660130719, 69.8888888888889, 70.8169934640523,
71.797385620915, 74.9346405228758, 79.5751633986928, 81.1960784313726,
82.6928104575163, 84.437908496732, 85.078431372549, 85.7516339869281
), outdoor_temp = c(11.6260784313725, 11.5248684210526, 11.3460927152318,
11.2996052631579, 11.1515131578947, 11.0651315789474, 10.9693421052632,
10.9631578947368, 11.595, 12.5839473684211, 13.8054605263158,
14.7923026315789, 15.2561184210526, 15.5032026143791, 15.629477124183,
15.517385620915, 15.143137254902, 14.3092810457516, 13.5077124183007,
13.0369934640523, 12.5659477124183, 12.2815032679739, 12.0554248366013,
11.8245751633987)), row.names = c(NA, -24L), groups = structure(list(
city_name = "Auckland", .rows = structure(list(1:24), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
Upvotes: 0
Views: 65
Reputation: 389135
A tidyverse
way to do this would be to fit a linear regression model for each device_id
.
library(tidyverse)
library(pixiedust)
inner_join(reg_data, weather, by = c('city' = 'city_name', 'h')) %>%
group_by(device_id) %>%
summarise(model = list(lm(mean_t ~ outdoor_temp)),
data = purrr::map(model, ~dust(.x) %>% as.tibble)) %>%
unnest(data) %>%
select(-model)
# device_id term estimate std.error statistic p.value
# <chr> <chr> <chr> <chr> <chr> <chr>
#1 2C7970 (Intercept) 15.861 0.284 55.831 0
#2 2C7970 outdoor_temp 0.23 0.022 10.526 0
#3 2C797D (Intercept) 11.146 0.85 13.112 0
#4 2C797D outdoor_temp 0.422 0.065 6.449 0
Upvotes: 1
Reputation: 1597
Well. I found a book called R for Data Science online and it answered my question. I built on the suggestion from @Ronak Shah, so thanks very much for that. Hope this is useful for others.
# filter down to the necessary stuff
reg_data <-
prep_data %>%
select(device_id, h, city, mean_t) %>%
filter(as.integer(h) %in% c(20:23, 0:7))
# don't worry about extra day time hours in Hourly Weather NA ignored later.
merged <- merge(reg_data, Hourly_Weather, by.x = c('city', 'h') , by.y = c('city_name', 'h'))
new_table <- c()
# get unique list of device_ids
deviceList <-
prep_data %>%
select(device_id) %>%
unique() %>%
unlist() %>%
unname()
for (dev in deviceList) {
subset <- merged %>%
filter(device_id == dev)
model <- lm(mean_t ~ outdoor_temp, data = subset, na.action=na.exclude)
new_table <- rbind(new_table, dust(model) %>% as.tibble())
}
Upvotes: 0