Reputation: 469
I have a data.frame with several columns and want to filter low-frequency data according to the combination of the variables. The example is like having Male/Female in Sex variable and High/Low in Cholesterol variable. Then my data frame would be like:
set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df
index Sex Age
1 1 Male High
2 2 Female High
3 3 Male High
4 4 Female High
5 5 Female High
6 6 Male High
7 7 Female High
8 8 Female High
9 9 Female Low
10 10 Male Low
11 11 Female High
12 12 Male High
13 13 Female High
14 14 Female High
15 15 Male Low
16 16 Female Low
17 17 Male High
18 18 Male Low
19 19 Male Low
20 20 Female Low
Now I want to filter the combination of Sex/Age where the frequency is higher than 3
table(df[,2:3])
Age
Sex High Low
Female 8 3
Male 5 4
Other words, I want to keep the indices for female-high, male-low and male-high.
Notice that 1) my data frame has several variables (not like the example above) and 2) I do not want to use any third R package and 3) I want it to be fast.
Upvotes: 5
Views: 2070
Reputation: 70266
Here's a simple approach in base R:
lvls <- interaction(df$Sex, df$Age)
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]
# index Sex Age
#1 1 Male High
#2 2 Female High
#3 3 Male High
#4 4 Female High
#5 5 Female High
#6 6 Male High
#7 7 Female High
#8 8 Female High
#10 10 Male Low
#11 11 Female High
#12 12 Male High
#13 13 Female High
#14 14 Female High
#15 15 Male Low
#17 17 Male High
#18 18 Male Low
#19 19 Male Low
If you have a larger number of variables, you can store them in a vector:
vars <- c("Age", "Sex") # add more
lvls <- interaction(df[, vars])
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]
And here's a second base R approach using ave
:
subset(df, ave(as.integer(factor(Sex)), Sex, Age, FUN = "length") > 3)
Upvotes: 7
Reputation: 10350
A dplyr
answer would be
library(dplyr)
df %>%
group_by(Sex, Age) %>%
filter(n() > 3)
Even though stated in OP this is not a base R solution. Thought it might be useful for future users who don't have such restrictions.
Upvotes: 2
Reputation: 887118
We can do this with data.table
and it should be efficient as well
library(data.table)
setDT(df)[, .SD[.N > 3], .(Sex, Age)]
Or with .I
setDT(df)[df[, .I[.N >3], .(Sex, Age)]$V1]
Upvotes: 5
Reputation: 47310
vars <- c("Sex","Age")
max_freq <- 3
new_df <- merge(df, subset(as.data.frame(table(df[,vars])),Freq>max_freq)[1:2])
new_df
# Sex Age index
# 1 Female High 2
# 2 Female High 7
# 3 Female High 14
# 4 Female High 11
# 5 Female High 5
# 6 Female High 4
# 7 Female High 13
# 8 Female High 8
# 9 Male High 6
# 10 Male High 3
# 11 Male High 1
# 12 Male High 17
# 13 Male High 12
# 14 Male Low 10
# 15 Male Low 15
# 16 Male Low 18
# 17 Male Low 19
Upvotes: 1
Reputation: 8117
OK, here is a Base-R option
set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df
merge(
df
, aggregate(rep(1, nrow(df)), by = df[,c("Sex", "Age")], sum)
, by = c("Sex", "Age")
)
The aggregate function sum
s up all the 1
s for all the combinations.
Upvotes: 4