Reputation: 563
I have two questions: What resources do you recommend reading to improve data manipulation capabilities? I've been working with larger datasets and have been struggling to adapt--I feel like I'm hitting a brick wall and don't know where to look (many online resources get too complicated without building foundation).
For example, I am trying to solve this issue. I have a df with millions of rows and I am trying to simplify it and analyze a trend. I have a dput example. I am trying to isolate each ID and grab the minimum value for a given year. (Some IDs have years not available for others). After simplifying that data, I am trying to add a percent change column. Given this is a 20+ year time series, I am ok with ignoring months at this point, as minimum value for a year compared to a minimum to another year should yield a reasonable percent change.
Thanks!
Input:
structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), .Label = c("a", "b"), class = "factor"), Date = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 10L, 12L, 14L, 7L, 8L, 9L, 11L, 13L, 5L,
6L, 10L, 12L, 14L, 7L, 8L, 9L, 11L, 13L, 15L, 16L), .Label = c("2/21/2009",
"2/22/2009", "2/23/2009", "2/24/2009", "2/25/2009", "2/26/2009",
"3/2/2011", "3/3/2011", "3/4/2011", "3/5/2010", "3/5/2011", "3/6/2010",
"3/6/2011", "3/7/2010", "3/7/2011", "3/8/2011"), class = "factor"),
Year = c(2009L, 2009L, 2009L, 2009L, 2009L, 2009L, 2010L,
2010L, 2010L, 2011L, 2011L, 2011L, 2011L, 2011L, 2009L, 2009L,
2010L, 2010L, 2010L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L,
2011L), Value = c(10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 5, 6, 7, 8, 8, 9, 10, 11, 12, 15, 23, 25, 27)), .Names = c("ID",
"Date", "Year", "Value"), class = "data.frame", row.names = c(NA,
-26L))
Expected output:
structure(list(ID = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("a",
"b"), class = "factor"), Date = structure(c(1L, 4L, 5L, 2L, 4L,
3L), .Label = c("2/21/2009", "2/25/2009", "3/2/2011", "3/5/2010",
"3/6/2011"), class = "factor"), Year = c(2009L, 2010L, 2011L,
2009L, 2010L, 2011L), Value = c(10, 16, 5, 6, 8, 10), Percent.Increase = c(NA,
0.6, -0.6875, NA, 0.333333333, 0.25)), .Names = c("ID", "Date",
"Year", "Value", "Percent.Increase"), class = "data.frame", row.names = c(NA,
-6L))
Upvotes: 2
Views: 104
Reputation: 16697
Until HAVING clause is implemented in data.table, this seems to be pretty efficient way:
dt[dt[, .I[which.min(Value)],, .(ID, Year)]$V1
][, Percent_Increase := {
tmp <- shift(Value)
(Value-tmp)/tmp
}, .(ID)]
Check timing on 5e7.
library(dplyr)
library(data.table)
N = 5e7
set.seed(1)
df = data.frame(ID = sample(2L, N, TRUE),
Date = sample(16L, N, TRUE),
Year = sample(2009:2011, N, TRUE),
Value = sample(N/10, N, TRUE))
dt = as.data.table(df)
system.time(
res <- df %>%
group_by(ID, Year) %>%
slice(which.min(Value)) %>%
group_by(ID) %>%
mutate(Percent_Increase = (Value-lag(Value))/lag(Value))
)
# user system elapsed
# 1.676 2.176 3.847
system.time(
r <- dt[dt[, .I[which.min(Value)],, .(ID, Year)]$V1,
][, Percent_Increase := {
tmp <- shift(Value)
(Value-tmp)/tmp
}, .(ID)]
)
# user system elapsed
# 0.940 0.460 1.334
all.equal(r, as.data.table(res), ignore.col.order = TRUE, check.attributes = FALSE, ignore.row.order = TRUE)
#[1] TRUE
Upvotes: 2
Reputation: 887153
After grouping by 'ID', 'Year', we slice
the min
"Value" rows within each group, then grouped by 'ID', we create the 'Percent.Increase' by subtracting the 'Value' from lag
of 'Value' and dividing by the lag
of 'Value'.
res <- df1 %>%
group_by(ID, Year) %>%
slice(which.min(Value)) %>%
group_by(ID) %>%
mutate(Percent.Increase = (Value-lag(Value))/lag(Value))
Upvotes: 3