MyQ
MyQ

Reputation: 469

The efficient way of filtering low frequency data in data frame in R

I have a data.frame with several columns and want to filter low-frequency data according to the combination of the variables. The example is like having Male/Female in Sex variable and High/Low in Cholesterol variable. Then my data frame would be like:

set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df


  index    Sex  Age
1      1   Male High
2      2 Female High
3      3   Male High
4      4 Female High
5      5 Female High
6      6   Male High
7      7 Female High
8      8 Female High
9      9 Female  Low
10    10   Male  Low
11    11 Female High
12    12   Male High
13    13 Female High
14    14 Female High
15    15   Male  Low
16    16 Female  Low
17    17   Male High
18    18   Male  Low
19    19   Male  Low
20    20 Female  Low

Now I want to filter the combination of Sex/Age where the frequency is higher than 3

table(df[,2:3])
        Age
Sex      High Low
  Female    8   3
  Male      5   4

Other words, I want to keep the indices for female-high, male-low and male-high.

Notice that 1) my data frame has several variables (not like the example above) and 2) I do not want to use any third R package and 3) I want it to be fast.

Upvotes: 5

Views: 2070

Answers (5)

talat
talat

Reputation: 70266

Here's a simple approach in base R:

lvls <- interaction(df$Sex, df$Age)
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]

#   index    Sex  Age
#1      1   Male High
#2      2 Female High
#3      3   Male High
#4      4 Female High
#5      5 Female High
#6      6   Male High
#7      7 Female High
#8      8 Female High
#10    10   Male  Low
#11    11 Female High
#12    12   Male High
#13    13 Female High
#14    14 Female High
#15    15   Male  Low
#17    17   Male High
#18    18   Male  Low
#19    19   Male  Low

If you have a larger number of variables, you can store them in a vector:

vars <- c("Age", "Sex") # add more
lvls <- interaction(df[, vars])
counts <- table(lvls)
df[lvls %in% names(counts)[counts > 3], ]

And here's a second base R approach using ave:

subset(df, ave(as.integer(factor(Sex)), Sex, Age, FUN = "length") > 3)

Upvotes: 7

loki
loki

Reputation: 10350

A dplyr answer would be

library(dplyr)
df %>% 
  group_by(Sex, Age) %>% 
  filter(n() > 3) 

Even though stated in OP this is not a base R solution. Thought it might be useful for future users who don't have such restrictions.

Upvotes: 2

akrun
akrun

Reputation: 887118

We can do this with data.table and it should be efficient as well

library(data.table)
setDT(df)[, .SD[.N > 3], .(Sex, Age)]

Or with .I

setDT(df)[df[, .I[.N >3], .(Sex, Age)]$V1]

Upvotes: 5

moodymudskipper
moodymudskipper

Reputation: 47310

vars     <- c("Sex","Age")
max_freq <- 3
new_df   <- merge(df, subset(as.data.frame(table(df[,vars])),Freq>max_freq)[1:2])

new_df
#       Sex  Age index
# 1  Female High     2
# 2  Female High     7
# 3  Female High    14
# 4  Female High    11
# 5  Female High     5
# 6  Female High     4
# 7  Female High    13
# 8  Female High     8
# 9    Male High     6
# 10   Male High     3
# 11   Male High     1
# 12   Male High    17
# 13   Male High    12
# 14   Male  Low    10
# 15   Male  Low    15
# 16   Male  Low    18
# 17   Male  Low    19

Upvotes: 1

Georgery
Georgery

Reputation: 8117

OK, here is a Base-R option

set.seed(123)
Sex = sample(c('Male','Female'),size = 20,replace = TRUE)
Age = sample(c('Low','High'),size = 20,replace = TRUE)
Index = 1:20
df = data.frame(index = Index,Sex=Sex,Age=Age)
df

merge(
    df
    , aggregate(rep(1, nrow(df)), by = df[,c("Sex", "Age")], sum)
    , by = c("Sex", "Age")
)

The aggregate function sums up all the 1s for all the combinations.

Upvotes: 4

Related Questions