Reputation: 3
I am new to R and am running into difficulty with more advanced filtering. I have a data frame containing 1500 rows of people in households and need to filter out everyone who is part of a household where at least 1 person is older than 24. For example, in the sample set below I would only want to keep rows 3,4, and 5.
PersonalID DOB HouseholdID
1 1961-04-15 123
2 2017-01-12 123
3 2000-01-02 122
4 2001-03-05 122
5 1996-08-22 122
Initially I just filtered to get a new data frame with everyone in that age range and then filtered the original data frame again (and again and again and so on...) with each HouseholdID of someone under 25 to check if anyone else with that HouseholdID is over 24.
Whenever I'm doing the same thing over and over it seems like there's probably a way to use a function instead but I'm having a hard time coming up with one that works. This is my current attempt but I know there's plenty wrong with it:
UNDER25df <- filter(df, DOB >= "yyyy-mm-dd")
for (UNDER25df$HouseholdID in df) {
if (all(df$DOB >= "yyyy-mm-dd")) {
view(filter(df, HouseholdID == "$HouseholdID"))
}
}
The error I get is:
unexpected '}' in "}"
but I'm pretty sure that I can nest an if statement in a for loop in R and that I've been careful about the positioning of the brackets so I don't know exactly what it's referring to.
What I'm not sure of is if I can iterate through a data frame in this way or if this even makes sense. I've read that vectoring might be better in general for advanced filtering but tried to read the documentation on it and couldn't really see how to make that jump to this problem. Does anyone have a suggestion or a direction I should be looking in?
Upvotes: 0
Views: 1349
Reputation: 6956
You do not need a loop for this. Try
library(lubridate)
library(dplyr)
set.seed(1)
df <- tibble(DOB = Sys.Date() - sample(3000:12000, 6),
personalID = 1:6,
HouseholdID = c(1,1,2,2,2,3))
df$DOB
# grab householdID from all persons that are at least 24
oldies <- df[(lubridate::today() - lubridate::ymd(df$DOB)) > years(24),
"HouseholdID", TRUE]
# base R way
oldies <- df[as.Date(df$DOB) > as.Date("1993-2-10"),
"HouseholdID", TRUE]
# household members in a household with someone 24 or older
df %>%
filter(HouseholdID %in% oldies)
# household members in a household with noone 24 or older
df %>%
filter(!(HouseholdID %in% oldies))
Upvotes: 1
Reputation: 101099
I am not sure if you want keep the rows grouped by ID that all users are less than or equal to 24-year old. If so, then maybe you can try the code below
library(lubridate)
dfout <- subset(df, ave(floor(time_length(Sys.Date()-as.Date(DOB),"years"))<=24, HouseholdID, FUN = all))
for
loop to make it, then the below is an exampledfout <- data.frame()
for (id in unique(df$HouseholdID)) {
subdf <- subset(df,HouseholdID == id)
if (with(subdf, all(floor(time_length(Sys.Date()-as.Date(DOB),"years"))<=24))) {
dfout <- rbind(dfout,subdf)
}
}
Both approaches above can give you the result shown as
> dfout
PersonalID DOB HouseholdID
3 3 2000-01-02 122
4 4 2001-03-05 122
5 5 1996-08-22 122
DATA
df <- structure(list(PersonalID = 1:5, DOB = c("1961-04-15", "2017-01-12",
"2000-01-02", "2001-03-05", "1996-08-22"), HouseholdID = c(123L,
123L, 122L, 122L, 122L)), class = "data.frame", row.names = c(NA,
-5L))
Upvotes: 0
Reputation: 388817
I am not sure if you want to select household where all the people are above 24 or at least one person is above 24. In any case, you can use subset
with ave
subset(df, ave(as.integer(format(Sys.Date(), "%Y")) -
as.integer(format(DOB, "%Y")) >= 24, HouseholdID, FUN = any))
This selects households where at least one person is above 24. If you want to select households where all people are above 24 use all
instead of any
in FUN
argument.
Similarly, using dplyr
, we can use
library(dplyr)
df %>%
group_by(HouseholdID) %>%
filter(any(as.integer(format(Sys.Date(), "%Y")) -
as.integer(format(DOB, "%Y")) >= 24))
Upvotes: 0