Reputation: 47
I'm quite new to R and I am struggling to implement a simple (or maybe not so simple) for loop. Look at the following table:
library(data.table)
data <-
data.table( stateID = c(1, 1, 1, 1, 1, 1, 2, 2),
districtID = c(1, 1, 1, 1, 2, 2, 1, 1),
neighbID = c(1, 1, 2, 2, 1, 1, 1, 1),
HHID = c(1, 1, 2, 2, 3, 3, 4, 4),
answer = c(1, 1, 0, 0, 2, 2, 1, 1))
I got identifiers for each household (HHID
), that will find themselves in a certain neighborhood (neighbID
), which will be in a certain district (districtID
), which will be in a certain state (stateID
). Each household will respond to a question and give answers 0
,1
or 2
.
This is what I want to do: I want to calculate within each neighborhood the proportion of households that responded either 1
or 2
over the total number of households in that neighborhood. I know this can be done using a for loop and I tried to figure it out for my case, but I'm a bit lost.
Any help is highly appreciated <3
Upvotes: 1
Views: 119
Reputation: 6220
To round out the answers, here's a dplyr approach. Wordier than data.table but may be more understandable to beginners.
library(dplyr)
data <- tibble( stateID = c(1, 1, 1, 1, 1, 1, 2, 2),
districtID = c(1, 1, 1, 1, 2, 2, 1, 1),
neighbID = c(1, 1, 2, 2, 1, 1, 1, 1),
HHID = c(1, 1, 2, 2, 3, 3, 4, 4),
answer = c(1, 1, 0, 0, 2, 2, 1, 1))
data %>%
group_by(stateID, districtID, neighbID) %>%
summarize(prop_12 = mean(answer %in% c(1, 2))) # like how Gregor Thomas showed.
#> # A tibble: 4 x 4
#> # Groups: stateID, districtID [3]
#> stateID districtID neighbID prop_12
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1 1
#> 2 1 1 2 0
#> 3 1 2 1 1
#> 4 2 1 1 1
Created on 2020-10-21 by the reprex package (v0.3.0)
Also equivalent, going step by step:
data %>%
group_by(stateID, districtID, neighbID) %>%
summarize(prop_12 = mean(answer %in% c(1, 2)))
data %>%
group_by(stateID, districtID, neighbID) %>%
mutate(answer_in_1or2 = answer %in% c(1, 2)) %>%
summarize(prop_12 = sum(answer_in_1or2)/sum(n()))
Upvotes: 1
Reputation: 4456
You can take the unique values that neighbID
assumes with unique
, and then check which rows correspond to that neighborhood with which
, then use that index to sum only the answers from these rows.
for(i in unique(data$neighbID)){
index = which(data$neighbID==i)
print(sum(data$answer[index] %in% c(1,2))/length(index))}
Upvotes: 1
Reputation: 145775
Calling your data dd
:
dd[, .(prop_12 = mean(answer %in% c(1, 2))), by = .(stateID, districtID, neighbID)]
# stateID districtID neighbID prop_12
# 1: 1 1 1 1
# 2: 1 1 2 0
# 3: 1 2 1 1
# 4: 2 1 1 1
Explanation: answer %in% c(1, 2)
will be TRUE or FALSE based on that condition. When you do math on TRUE/FALSE values they are converted to 1
or 0
automatically. So sum
of a TRUE/FALSE will give the count of TRUEs, and mean
of TRUE/FALSE will give the proportion that are TRUE.
And data.table
lets you do any operation "by group" using the by
argument. Any intro to data.table
will cover this well.
Upvotes: 2