user189035
user189035

Reputation: 5789

Left join on date range by group ID

Consider this token data set:

Data = structure(list(txs = c(-50, -750, -35, -5.96, -61.5, -42.07, 
-142.4, -500, 132, -154.89, -109.51, -2000, -50, -40, -24.98, 
-15.6, -50, -147.72, -20, -6.6, -5, -20, -13.48, -7.25, -54.09, 
-200, -124.11, -30, -50, -30, 400, -10, -0.95, -4.1, -10000, 
30, -1.99, 74.03, -6.95, -2.96, -29, -403.6, -6, -6, 5250, -513.57, 
-300, -10, -500, -20, -6.45, -7.26, -40, -50, -13.14, 321.29, 
-18, 100, -5.5, -25, -59.2, -10.75, -3.2, 270, 65.8, -11.6, -104.78, 
-99.39, 0.1, -50, -80, -50, -371.44, -78, 270, -6.3, 40, -2.5, 
-29.99, -189.48, -400, -0.29, -20, -6.55, -987.37, -1400, -0.49, 
-20, -29.04, -65, -40, -27.5, -17.37, -10, -1092.84, -5.5, -69.93, 
-15.07, -400, -4.8), week = structure(c(1439157600, 1454281200, 
1471212000, 1445205600, 1448233200, 1451862000, 1449442800, 1453676400, 
1460325600, 1460930400, 1445205600, 1454281200, 1460930400, 1444600800, 
1462140000, 1471816800, 1443996000, 1448838000, 1479682800, 1453071600, 
1447023600, 1473631200, 1465768800, 1433109600, 1445205600, 1433714400, 
1466978400, 1441576800, 1459116000, 1451862000, 1436133600, 1440367200, 
1456095600, 1458514800, 1456700400, 1450047600, 1440972000, 1446418800, 
1465164000, 1441576800, 1442181600, 1453071600, 1461535200, 1460930400, 
1438552800, 1464559200, 1447628400, 1434924000, 1437343200, 1436738400, 
1443391200, 1438552800, 1440972000, 1446418800, 1446418800, 1453071600, 
1453071600, 1457305200, 1444600800, 1462140000, 1435528800, 1457305200, 
1437948000, 1440972000, 1437948000, 1433109600, 1461535200, 1453676400, 
1454886000, 1454281200, 1441576800, 1441576800, 1471212000, 1453071600, 
1451862000, 1442786400, 1443391200, 1439762400, 1436133600, 1461535200, 
1442181600, 1468188000, 1442181600, 1453676400, 1466373600, 1443391200, 
1450652400, 1454886000, 1439157600, 1441576800, 1463954400, 1442181600, 
1446418800, 1454886000, 1476050400, 1461535200, 1456700400, 1456700400, 
1435528800, 1456700400), class = c("POSIXct", "POSIXt"), tzone = ""), 
    num_c = c(1219, 1257, 1195, 33, 1105, 1223, 1257, 1317, 486, 
    1227, 477, 1039, 1238, 1008, 1137, 1294, 1070, 596, 1295, 
    1354, 1010, 1294, 1348, 1254, 19, 1185, 24, 1287, 1198, 955, 
    1324, 1293, 1343, 1162, 1272, 972, 972, 179, 1343, 1105, 
    1085, 1020, 947, 1375, 1005, 477, 596, 1198, 928, 1137, 1263, 
    1237, 1054, 1288, 1185, 1115, 1257, 1301, 1294, 1185, 1039, 
    957, 1131, 33, 477, 1258, 477, 1039, 1362, 1246, 596, 1010, 
    972, 1238, 477, 1296, 972, 1148, 1105, 24, 553, 1297, 1288, 
    1223, 789, 1298, 1082, 1353, 1030, 1287, 1203, 1008, 1294, 
    1227, 1298, 1203, 1346, 1010, 19, 1303)), .Names = c("txs", 
"week", "num_c"), row.names = c(NA, -100L), class = "data.frame")

it has three columns:

Now, for every client I would like to use dplyr to expand (join) this data set so that I get one row for each week in the range of dates during which that client has placed orders.

Moreover, the new empty cells (for txs) should be filled with NA's. The values of txs corresponding to those weeks/clients that are present in the original (no expanded) data set should retain their original values;

I tried:

 library(dplyr)
 stretch_Data = Data %>%
                group_by(num_c) %>%
                right_join(seq(min(week), max(week), by = 'week'), by = "week")

But I get;

Error in seq(min(week), max(week), by = "week") : object 'week' not found

which is silly because Data does contain a week column (thank you very much).

What am I doing wrong?

Edit

Thanks to @mt1022 for his solution (comment below). This is quiet clever. But one problem remains: is it possible to include all the colums from Data? --To make my point I've added one: txs-- Then the missing values (those corresponding to weeks with no purchase) should be filled by NA's (like in a joint). The cells corresponding to client * week dates for which we had a purchase should just retain their original values.

In essence doing na.omit() on the new (expanded table) should return the original table; like it would on a joint;

Upvotes: 1

Views: 980

Answers (1)

Geochem B
Geochem B

Reputation: 428

Using the solution from @mt1002 with one simple addition to obtain the txs column. I have obtained a single value by sum(txs), where this could also be a single value of the min(txs) or max(txs) depending on your needs.

Data %>% group_by(num_c) %>% 
  summarise(week = list(seq(min(week), max(week), by = 'week')),
            txs = sum(txs)) %>% 
  unnest(week)

After the clarification, this is the solution I came up with where there are NA values for weeks there were no orders, in addition to the number of orders per user per week. You could also join the list of weeks with orders by num_c using a left-join including the above query to the df.

library(lubridate)
a <- data.frame(week = rep(seq(1,52,1)))
Data %>% 
  group_by(num_c) %>% 
  mutate(week_num = week(week)) %>% 
  group_by(num_c, week_num) %>% 
  summarise(txs = sum(txs),
            number_orders = n()) %>% 
  full_join(a, by = c("week_num"="week")) %>% 
  ungroup() %>% 
  arrange(week_num)

Upvotes: 1

Related Questions