Fabio
Fabio

Reputation: 47

R: For loop example

I'm quite new to R and I am struggling to implement a simple (or maybe not so simple) for loop. Look at the following table:

library(data.table) 
data <- 
data.table( stateID    = c(1, 1, 1, 1, 1, 1, 2, 2), 
            districtID = c(1, 1, 1, 1, 2, 2, 1, 1), 
            neighbID   = c(1, 1, 2, 2, 1, 1, 1, 1),
            HHID       = c(1, 1, 2, 2, 3, 3, 4, 4),
            answer     = c(1, 1, 0, 0, 2, 2, 1, 1))

I got identifiers for each household (HHID), that will find themselves in a certain neighborhood (neighbID), which will be in a certain district (districtID), which will be in a certain state (stateID). Each household will respond to a question and give answers 0,1 or 2.

This is what I want to do: I want to calculate within each neighborhood the proportion of households that responded either 1 or 2 over the total number of households in that neighborhood. I know this can be done using a for loop and I tried to figure it out for my case, but I'm a bit lost.

Any help is highly appreciated <3

Upvotes: 1

Views: 119

Answers (3)

Arthur Yip
Arthur Yip

Reputation: 6220

To round out the answers, here's a dplyr approach. Wordier than data.table but may be more understandable to beginners.

library(dplyr) 
data <- tibble( stateID    = c(1, 1, 1, 1, 1, 1, 2, 2), 
                    districtID = c(1, 1, 1, 1, 2, 2, 1, 1), 
                    neighbID   = c(1, 1, 2, 2, 1, 1, 1, 1),
                    HHID       = c(1, 1, 2, 2, 3, 3, 4, 4),
                    answer     = c(1, 1, 0, 0, 2, 2, 1, 1))
data %>% 
  group_by(stateID, districtID, neighbID) %>%
  summarize(prop_12 = mean(answer %in% c(1, 2))) # like how Gregor Thomas showed.
#> # A tibble: 4 x 4
#> # Groups:   stateID, districtID [3]
#>   stateID districtID neighbID prop_12
#>     <dbl>      <dbl>    <dbl>   <dbl>
#> 1       1          1        1       1
#> 2       1          1        2       0
#> 3       1          2        1       1
#> 4       2          1        1       1

Created on 2020-10-21 by the reprex package (v0.3.0)

Also equivalent, going step by step:

data %>% 
  group_by(stateID, districtID, neighbID) %>%
  summarize(prop_12 = mean(answer %in% c(1, 2)))

data %>%
  group_by(stateID, districtID, neighbID) %>%
  mutate(answer_in_1or2 = answer %in% c(1, 2)) %>%
  summarize(prop_12 = sum(answer_in_1or2)/sum(n()))

Upvotes: 1

Ricardo Semi&#227;o
Ricardo Semi&#227;o

Reputation: 4456

You can take the unique values that neighbID assumes with unique, and then check which rows correspond to that neighborhood with which, then use that index to sum only the answers from these rows.

for(i in unique(data$neighbID)){
  index = which(data$neighbID==i)
  print(sum(data$answer[index] %in% c(1,2))/length(index))}

Upvotes: 1

Gregor Thomas
Gregor Thomas

Reputation: 145775

Calling your data dd:

dd[, .(prop_12 = mean(answer %in% c(1, 2))), by = .(stateID, districtID, neighbID)]
#    stateID districtID neighbID prop_12
# 1:       1          1        1       1
# 2:       1          1        2       0
# 3:       1          2        1       1
# 4:       2          1        1       1

Explanation: answer %in% c(1, 2) will be TRUE or FALSE based on that condition. When you do math on TRUE/FALSE values they are converted to 1 or 0 automatically. So sum of a TRUE/FALSE will give the count of TRUEs, and mean of TRUE/FALSE will give the proportion that are TRUE.

And data.table lets you do any operation "by group" using the by argument. Any intro to data.table will cover this well.

Upvotes: 2

Related Questions