Reputation: 2341
I have a data.table
as follows:
panelID = c(1:50)
year = c(2001:2010)
country = c("NLD", "BEL", "GER")
urban = c("A", "B", "C")
indust = c("D", "E", "F")
sizes = c(1, 2, 3, 4, 5)
n <- 2
library(data.table)
set.seed(123)
DT <- data.table(
panelID = rep(sample(panelID), each = n),
country = rep(sample(country, length(panelID), replace = T), each = n),
year = c(replicate(length(panelID), sample(year, n))),
some_NA = sample(0:5, 6),
some_NA_factor = sample(0:5, 6),
industry = rep(sample(indust, length(panelID), replace = T), each = n),
urbanisation = rep(sample(urban, length(panelID), replace = T), each = n),
size = rep(sample(sizes, length(panelID), replace = T), each = n),
norm = round(runif(100)/10, 2),
sales = round(rnorm(10, 10, 10), 2),
Happiness = sample(10, 10),
Sex = round(rnorm(10, 0.75, 0.3), 2),
Age = sample(100, 100),
Educ = round(rnorm(10, 0.75, 0.3), 2)
)
DT [, uniqueID := .I] # Creates a unique ID
DT[DT == 0] <- NA
DT$sales[DT$sales< 0] <- NA
DT <- as.data.frame(DT)
What I want is the number of panelID
s for which the sum of size
is equal to 8. So I thought I would just do:
DT[sum(size)==8, condition:=1, by=panelID]
What am I doing wrong here?
Upvotes: 2
Views: 72
Reputation: 13309
With data.table
:
DT[,conditional:=ifelse(sum(size)==8,1,0),by=panelID][]
# To get the lengths of those which are True(1), save the above as res
#nrow(res[res[,conditional==1],"panelID"])
Or simply as @chinsoon12 suggests:
DT[, conditional := +(sum(size)==8), panelID]
Result(head):
panelID country year some_NA some_NA_factor industry urbanisation size norm sales
1: 31 GER 2010 4 1 F C 4 0.09 5.63
2: 31 GER 2005 2 NA F C 4 0.03 13.31
3: 15 NLD 2005 NA 4 D C 3 0.05 NA
4: 15 NLD 2008 1 5 D C 3 0.01 12.12
5: 14 BEL 2003 5 3 E B 1 0.09 22.37
6: 14 BEL 2002 3 2 E B 1 0.04 30.38
Happiness Sex Age Educ uniqueID conditional
1: 7 0.69 62 0.25 1 1
2: 3 1.00 10 1.31 2 1
3: 10 0.66 59 0.73 3 0
4: 9 0.85 49 0.88 4 0
5: 2 0.34 7 0.90 5 0
6: 5 0.84 61 1.11 6 0
Upvotes: 3
Reputation: 3632
You can achieve what you want by using this code which uses dplyr
:
library(dplyr)
DT %>%
group_by(panelID) %>%
summarize(sum = sum(size)) %>%
filter(sum == 8) %>%
pull(panelID)
#Output
[1] 11 14 15 16 18 27 28 34 38 45
If you want to show the number of panels, you can change pull(panelID)
with count()
, or adding lenght()
at the end, like this:
library(dplyr)
DT %>%
group_by(panelID) %>%
summarize(sum = sum(size)) %>%
filter(sum == 8) %>%
pull(panelID) %>%
length()
#Output
[1] 10
Hope this helps.
Upvotes: 2
Reputation: 1595
I just removed as.data.frame()
. I used joins to properly align the sum of size
by panelID
.
What I don't understand is if you want the values of panelID
for which you satisfy the condition given by the sum, I guess by panelID
. Or if you want just how many panelID
(i.e. individuals?) satisfy the condition.
In the former case, here is what you do:
panelID = c(1:50)
year = c(2001:2010)
country = c("NLD", "BEL", "GER")
urban = c("A", "B", "C")
indust = c("D", "E", "F")
sizes = c(1, 2, 3, 4, 5)
n <- 2
library(data.table)
set.seed(123)
DT <- data.table(
panelID = rep(sample(panelID), each = n),
country = rep(sample(country, length(panelID), replace = T), each = n),
year = c(replicate(length(panelID), sample(year, n))),
some_NA = sample(0:5, 6),
some_NA_factor = sample(0:5, 6),
industry = rep(sample(indust, length(panelID), replace = T), each = n),
urbanisation = rep(sample(urban, length(panelID), replace = T), each = n),
size = rep(sample(sizes, length(panelID), replace = T), each = n),
norm = round(runif(100)/10, 2),
sales = round(rnorm(10, 10, 10), 2),
Happiness = sample(10, 10),
Sex = round(rnorm(10, 0.75, 0.3), 2),
Age = sample(100, 100),
Educ = round(rnorm(10, 0.75, 0.3), 2)
)
DT [, uniqueID := .I] # Creates a unique ID
DT[DT == 0] <- NA
DT$sales[DT$sales< 0] <- NA
dt_sum = DT[ , .(size_sum = sum(size) ), by = panelID ]
setkey( dt_sum, panelID )
setkey( DT, panelID )
DT = DT[ dt_sum ]
final = DT[ size_sum == 8, .N, by = panelID ]
> final
panelID N
1: 6 2
2: 8 2
3: 9 2
4: 11 2
5: 18 2
6: 22 2
7: 28 2
8: 30 2
9: 31 2
10: 38 2
In the latter case, you simply count the number of rows of final
:
> nrow( final )
6
Upvotes: 0