Tom
Tom

Reputation: 2341

Creating a variable conditional on the sum of a variable by group

I have a data.table as follows:

panelID = c(1:50)   
year    = c(2001:2010)
country = c("NLD", "BEL", "GER")
urban   = c("A", "B", "C")
indust  = c("D", "E", "F")
sizes   = c(1, 2, 3, 4, 5)
n <- 2

library(data.table)

set.seed(123)
DT <- data.table(
    panelID = rep(sample(panelID), each = n),
    country = rep(sample(country, length(panelID), replace = T), each = n),
    year    = c(replicate(length(panelID), sample(year, n))),
    some_NA = sample(0:5, 6),                                             
    some_NA_factor = sample(0:5, 6),
    industry       = rep(sample(indust, length(panelID), replace = T), each = n),
    urbanisation   = rep(sample(urban, length(panelID), replace = T), each = n),
    size      = rep(sample(sizes, length(panelID), replace = T), each = n),
    norm      = round(runif(100)/10, 2),
    sales     = round(rnorm(10, 10, 10), 2),
    Happiness = sample(10, 10),
    Sex       = round(rnorm(10, 0.75, 0.3), 2),
    Age       = sample(100, 100),
    Educ      = round(rnorm(10, 0.75, 0.3), 2)
)        
DT [, uniqueID := .I]  # Creates a unique ID     
DT[DT == 0] <- NA 
DT$sales[DT$sales< 0] <- NA 
DT <- as.data.frame(DT)

What I want is the number of panelIDs for which the sum of size is equal to 8. So I thought I would just do:

DT[sum(size)==8, condition:=1, by=panelID]

What am I doing wrong here?

Upvotes: 2

Views: 72

Answers (3)

NelsonGon
NelsonGon

Reputation: 13309

With data.table:

DT[,conditional:=ifelse(sum(size)==8,1,0),by=panelID][]
# To get the lengths of those which are True(1), save the above as res
#nrow(res[res[,conditional==1],"panelID"])

Or simply as @chinsoon12 suggests:

DT[, conditional := +(sum(size)==8), panelID]

Result(head):

 panelID country year some_NA some_NA_factor industry urbanisation size norm sales
1:      31     GER 2010       4              1        F            C    4 0.09  5.63
2:      31     GER 2005       2             NA        F            C    4 0.03 13.31
3:      15     NLD 2005      NA              4        D            C    3 0.05    NA
4:      15     NLD 2008       1              5        D            C    3 0.01 12.12
5:      14     BEL 2003       5              3        E            B    1 0.09 22.37
6:      14     BEL 2002       3              2        E            B    1 0.04 30.38
   Happiness  Sex Age Educ uniqueID conditional
1:         7 0.69  62 0.25        1           1
2:         3 1.00  10 1.31        2           1
3:        10 0.66  59 0.73        3           0
4:         9 0.85  49 0.88        4           0
5:         2 0.34   7 0.90        5           0
6:         5 0.84  61 1.11        6           0

Upvotes: 3

Louis
Louis

Reputation: 3632

You can do it with dplyr

You can achieve what you want by using this code which uses dplyr:

library(dplyr)
DT %>%
  group_by(panelID) %>%
  summarize(sum = sum(size)) %>%
  filter(sum == 8) %>%
  pull(panelID)

#Output
[1] 11 14 15 16 18 27 28 34 38 45

Edit

If you want to show the number of panels, you can change pull(panelID) with count(), or adding lenght() at the end, like this:

library(dplyr)
DT %>%
  group_by(panelID) %>%
  summarize(sum = sum(size)) %>%
  filter(sum == 8) %>%
  pull(panelID) %>%
  length()

#Output
[1] 10

Hope this helps.

Upvotes: 2

Francesco Grossetti
Francesco Grossetti

Reputation: 1595

I just removed as.data.frame(). I used joins to properly align the sum of size by panelID.

What I don't understand is if you want the values of panelID for which you satisfy the condition given by the sum, I guess by panelID. Or if you want just how many panelID (i.e. individuals?) satisfy the condition.

In the former case, here is what you do:

panelID = c(1:50)   
year    = c(2001:2010)
country = c("NLD", "BEL", "GER")
urban   = c("A", "B", "C")
indust  = c("D", "E", "F")
sizes   = c(1, 2, 3, 4, 5)
n <- 2

library(data.table)

set.seed(123)
DT <- data.table(
  panelID = rep(sample(panelID), each = n),
  country = rep(sample(country, length(panelID), replace = T), each = n),
  year    = c(replicate(length(panelID), sample(year, n))),
  some_NA = sample(0:5, 6),                                             
  some_NA_factor = sample(0:5, 6),
  industry       = rep(sample(indust, length(panelID), replace = T), each = n),
  urbanisation   = rep(sample(urban, length(panelID), replace = T), each = n),
  size      = rep(sample(sizes, length(panelID), replace = T), each = n),
  norm      = round(runif(100)/10, 2),
  sales     = round(rnorm(10, 10, 10), 2),
  Happiness = sample(10, 10),
  Sex       = round(rnorm(10, 0.75, 0.3), 2),
  Age       = sample(100, 100),
  Educ      = round(rnorm(10, 0.75, 0.3), 2)
)        
DT [, uniqueID := .I]  # Creates a unique ID     
DT[DT == 0] <- NA 
DT$sales[DT$sales< 0] <- NA 

dt_sum = DT[ , .(size_sum = sum(size) ), by = panelID ]
setkey( dt_sum, panelID )
setkey( DT, panelID )

DT = DT[ dt_sum ]
final = DT[ size_sum == 8, .N, by = panelID ]
> final
    panelID N
 1:       6 2
 2:       8 2
 3:       9 2
 4:      11 2
 5:      18 2
 6:      22 2
 7:      28 2
 8:      30 2
 9:      31 2
10:      38 2

In the latter case, you simply count the number of rows of final:

> nrow( final )
6

Upvotes: 0

Related Questions