Reputation: 197

finding outliers and counting number of occurrence

I have a data frame as below

raw_data <- data.frame(
  "id" = c(1, 1, 1, 2, 2),
  "salary" = c(10000,15000,20000,40000,50000),
  "expenditure" = c(10000,15000,20000,30000,40000))

if a salary greater than 15000 would be flaged as an outlier, and if an expenditure is greater than 10000, it should be flaged as an outlier. But the problem is now, how to count how many times an outlier (both spearately) occured by a specific id. The output should look like the following df

output <- data.frame(
  "id"=c(1,1,1,2,2,2,2),
  "question_name"=c("expenditure", "salary","expenditure","salary","expenditure","salary","expenditure"),
  "values"=c(15000,20000,20000,30000,40000,500000,40000),
  "count"=c(1,1,1,1,1,1,1))

Upvotes: 2

Answers (4)

MacOS

Reputation: 1159

You can try the following

raw_data <- data.frame("id" = 1:5, 
                       "salary" = c(10000,15000,20000,40000,50000), 
                       "expenditure" = c(10000,15000,20000,30000,40000))

raw_data$SaleryOutlier <- ifelse(
    raw_data$salary > 15000, TRUE, FALSE)

raw_data$ExpenditureOutlier <- ifelse(
    raw_data$expenditure > 10000, TRUE, FALSE)

You can then use aggregate function to summarize the data, e.g. for each id by using FUN=sum. This should look like

aggregate(raw_data, by=list(id = raw_data$id), FUN=sum)

This works because TRUE=1.

I hope this helps.

EDIT

Based on your comment, I guess you are looking for

raw_data <- data.frame("id" = c(1, 1, 1, 2, 2), 
                       "salary" = c(10000,15000,20000,40000,50000), 
                       "expenditure" = c(10000,15000,20000,30000,40000))

raw_data$SaleryOutlier <- ifelse(
  raw_data$salary > 15000, TRUE, FALSE)

raw_data$ExpenditureOutlier <- ifelse(
  raw_data$expenditure > 10000, TRUE, FALSE)

raw_data_aggregate <- aggregate(raw_data, by=list(id = raw_data$id), FUN=sum)

raw_data_aggregate$count <- raw_data_aggregate$SaleryOutlier + raw_data_aggregate$ExpenditureOutlier

EDIT TWO

If you want to aggregate over two variables, just exchange the above aggregate with

raw_data_aggregate <- aggregate(
  SalaryOutlier + ExpenditureOutlier ~ id + salary + expenditure, raw_data, FUN=sum)

EDIT THREE

Based on the comments below, I created the following code

raw_data <- data.frame(
  "id" = c(1, 1, 1, 2, 2),
  "salary" = c(10000,15000,20000,40000,50000),
  "expenditure" = c(10000,15000,20000,30000,40000))

# Identify salary outliers
raw_data$SalaryOutlier <- ifelse(
  raw_data$salary > 15000, TRUE, FALSE)

# Identify expenditure outliers
raw_data$ExpenditureOutlier <- ifelse(
  raw_data$expenditure > 10000, TRUE, FALSE)

# Aggregate over id + salay
raw_data_aggregate_salary <- aggregate(
  SalaryOutlier ~ id + salary, raw_data, FUN=sum)

# Aggregate over id + expenditure
raw_data_aggregate_expenditure <- aggregate(
  ExpenditureOutlier ~ id + expenditure, raw_data, FUN=sum)

# Just some renaming to fit with desired output data frame.
raw_data_aggregate_salary$question_name <- "salary"
raw_data_aggregate_expenditure$question_name <- "expenditure"

colnames(raw_data_aggregate_salary)[2] <- "values"
colnames(raw_data_aggregate_expenditure)[2] <- "values"

colnames(raw_data_aggregate_salary)[3] <- "count"
colnames(raw_data_aggregate_expenditure)[3] <- "count"

# Bind result together into one df.
raw_data_aggregate <- rbind(
  raw_data_aggregate_salary, raw_data_aggregate_expenditure)

# Only select entries where we actually have a count.
raw_data_aggregate <- subset(
  raw_data_aggregate,
  raw_data_aggregate$count > 0)

# Order to fit with desired output
raw_data_aggregate <- raw_data_aggregate[ order(raw_data_aggregate$id), ]

Upvotes: 1

MaxMiak

Reputation: 197

Raw data is :

    raw_data <- data.frame("id" = c(1, 1, 1, 2, 2), 
                           "salary" = c(10000,15000,20000,40000,50000), 
                           "expenditure" = c(10000,15000,15000,30000,40000))

And the solution is :

    raw_data <- raw_data %>% filter(salary>15000 | expenditure>10000)
    entry_variables <- raw_data %>%select(id,salary,expenditure) %>% 
      pivot_longer(cols = -id,
                   names_to = "Question_name", values_to= "Value", 
                   values_drop_na = TRUE) %>%
      count(id, Question_name, Value)

Upvotes: 1

Matt

Reputation: 7385

Here's a dplyr solution:

raw_data %>% 
  mutate(salary_flag =
           ifelse(salary > 15000, 1, 0),
         expenditure_flag = ifelse(expenditure > 10000, 1, 0)) %>% 
  group_by(id) %>% 
  mutate(total_outlier = sum(salary_flag) + sum(expenditure_flag))

You are flagging for salary and expenditure, then grouping by id and calculating the sum of all salary_flag and the sum of all expenditure_flag for each id.

 id salary expenditure salary_flag expenditure_flag total_outlier
  <int>  <dbl>       <dbl>       <dbl>            <dbl>         <dbl>
1     1  10000       10000           0                0             0
2     2  15000       15000           0                1             1
3     3  20000       20000           1                1             2
4     4  40000       30000           1                1             2
5     5  50000       40000           1                1             2

If you're only concerned with the total outliers, @MartinGal provided a very nice option:

raw_data %>% 
group_by(id) %>% 
mutate(total_outlier = sum(salary>15000, expenditure>10000))

Gives us:

     id salary expenditure total_outlier
  <int>  <dbl>       <dbl>         <int>
1     1  10000       10000             0
2     2  15000       15000             1
3     3  20000       20000             2
4     4  40000       30000             2
5     5  50000       40000             2

edit:

This seems to get the end result that you're looking for:

raw_data %>% 
  group_by(id) %>% 
  summarise(count = sum(salary>15000, expenditure>10000),
            value = min(salary)) %>% 
  mutate(title = "salary") %>% 
  select(id, title, value, count)

Which gives you:

     id title  value count
  <int> <chr>  <dbl> <int>
1     1 salary 10000     0
2     2 salary 15000     1
3     3 salary 20000     2
4     4 salary 40000     2
5     5 salary 50000     2

Upvotes: 1

NiklasvMoers

Reputation: 329

In a data.table this would look like

raw_data[, flag0 := (salary > 15000) + (expenditure > 10000)]
raw_data[, flag := sum(flag0), by = "id"]

Here flag0 is the flag by row (which can later be deleted if you like) and flag would be the final result.

Edit: Seeing your reply to @Matt, you seem to want the total amount by salary and expenditure seperately. You can do something like

raw_data[, flag_salary := as.integer(salary > 15000)]
raw_data[, flag_expenditure := as.integer(expenditure > 10000)]
raw_data[, flag_salary := sum(flag_salary), by = "id"]
raw_data[, flag_expenditure := sum(flag_expenditure), by = "id"]

Upvotes: 0

finding outliers and counting number of occurrence

Answers (4)

Related Questions