Reputation: 197
I have a data frame as below
raw_data <- data.frame(
"id" = c(1, 1, 1, 2, 2),
"salary" = c(10000,15000,20000,40000,50000),
"expenditure" = c(10000,15000,20000,30000,40000))
if a salary greater than 15000 would be flaged as an outlier, and if an expenditure is greater than 10000, it should be flaged as an outlier. But the problem is now, how to count how many times an outlier (both spearately) occured by a specific id. The output should look like the following df
output <- data.frame(
"id"=c(1,1,1,2,2,2,2),
"question_name"=c("expenditure", "salary","expenditure","salary","expenditure","salary","expenditure"),
"values"=c(15000,20000,20000,30000,40000,500000,40000),
"count"=c(1,1,1,1,1,1,1))
Upvotes: 2
Views: 204
Reputation: 1159
You can try the following
raw_data <- data.frame("id" = 1:5,
"salary" = c(10000,15000,20000,40000,50000),
"expenditure" = c(10000,15000,20000,30000,40000))
raw_data$SaleryOutlier <- ifelse(
raw_data$salary > 15000, TRUE, FALSE)
raw_data$ExpenditureOutlier <- ifelse(
raw_data$expenditure > 10000, TRUE, FALSE)
You can then use aggregate
function to summarize the data, e.g. for each id by using FUN=sum
. This should look like
aggregate(raw_data, by=list(id = raw_data$id), FUN=sum)
This works because TRUE=1
.
I hope this helps.
EDIT
Based on your comment, I guess you are looking for
raw_data <- data.frame("id" = c(1, 1, 1, 2, 2),
"salary" = c(10000,15000,20000,40000,50000),
"expenditure" = c(10000,15000,20000,30000,40000))
raw_data$SaleryOutlier <- ifelse(
raw_data$salary > 15000, TRUE, FALSE)
raw_data$ExpenditureOutlier <- ifelse(
raw_data$expenditure > 10000, TRUE, FALSE)
raw_data_aggregate <- aggregate(raw_data, by=list(id = raw_data$id), FUN=sum)
raw_data_aggregate$count <- raw_data_aggregate$SaleryOutlier + raw_data_aggregate$ExpenditureOutlier
EDIT TWO
If you want to aggregate
over two variables, just exchange the above aggregate
with
raw_data_aggregate <- aggregate(
SalaryOutlier + ExpenditureOutlier ~ id + salary + expenditure, raw_data, FUN=sum)
EDIT THREE
Based on the comments below, I created the following code
raw_data <- data.frame(
"id" = c(1, 1, 1, 2, 2),
"salary" = c(10000,15000,20000,40000,50000),
"expenditure" = c(10000,15000,20000,30000,40000))
# Identify salary outliers
raw_data$SalaryOutlier <- ifelse(
raw_data$salary > 15000, TRUE, FALSE)
# Identify expenditure outliers
raw_data$ExpenditureOutlier <- ifelse(
raw_data$expenditure > 10000, TRUE, FALSE)
# Aggregate over id + salay
raw_data_aggregate_salary <- aggregate(
SalaryOutlier ~ id + salary, raw_data, FUN=sum)
# Aggregate over id + expenditure
raw_data_aggregate_expenditure <- aggregate(
ExpenditureOutlier ~ id + expenditure, raw_data, FUN=sum)
# Just some renaming to fit with desired output data frame.
raw_data_aggregate_salary$question_name <- "salary"
raw_data_aggregate_expenditure$question_name <- "expenditure"
colnames(raw_data_aggregate_salary)[2] <- "values"
colnames(raw_data_aggregate_expenditure)[2] <- "values"
colnames(raw_data_aggregate_salary)[3] <- "count"
colnames(raw_data_aggregate_expenditure)[3] <- "count"
# Bind result together into one df.
raw_data_aggregate <- rbind(
raw_data_aggregate_salary, raw_data_aggregate_expenditure)
# Only select entries where we actually have a count.
raw_data_aggregate <- subset(
raw_data_aggregate,
raw_data_aggregate$count > 0)
# Order to fit with desired output
raw_data_aggregate <- raw_data_aggregate[ order(raw_data_aggregate$id), ]
Upvotes: 1
Reputation: 197
Raw data is :
raw_data <- data.frame("id" = c(1, 1, 1, 2, 2),
"salary" = c(10000,15000,20000,40000,50000),
"expenditure" = c(10000,15000,15000,30000,40000))
And the solution is :
raw_data <- raw_data %>% filter(salary>15000 | expenditure>10000)
entry_variables <- raw_data %>%select(id,salary,expenditure) %>%
pivot_longer(cols = -id,
names_to = "Question_name", values_to= "Value",
values_drop_na = TRUE) %>%
count(id, Question_name, Value)
Upvotes: 1
Reputation: 7385
Here's a dplyr
solution:
raw_data %>%
mutate(salary_flag =
ifelse(salary > 15000, 1, 0),
expenditure_flag = ifelse(expenditure > 10000, 1, 0)) %>%
group_by(id) %>%
mutate(total_outlier = sum(salary_flag) + sum(expenditure_flag))
You are flagging for salary
and expenditure
, then grouping by id
and calculating the sum of all salary_flag
and the sum of all expenditure_flag
for each id
.
id salary expenditure salary_flag expenditure_flag total_outlier
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 10000 10000 0 0 0
2 2 15000 15000 0 1 1
3 3 20000 20000 1 1 2
4 4 40000 30000 1 1 2
5 5 50000 40000 1 1 2
If you're only concerned with the total outliers, @MartinGal provided a very nice option:
raw_data %>%
group_by(id) %>%
mutate(total_outlier = sum(salary>15000, expenditure>10000))
Gives us:
id salary expenditure total_outlier
<int> <dbl> <dbl> <int>
1 1 10000 10000 0
2 2 15000 15000 1
3 3 20000 20000 2
4 4 40000 30000 2
5 5 50000 40000 2
edit:
This seems to get the end result that you're looking for:
raw_data %>%
group_by(id) %>%
summarise(count = sum(salary>15000, expenditure>10000),
value = min(salary)) %>%
mutate(title = "salary") %>%
select(id, title, value, count)
Which gives you:
id title value count
<int> <chr> <dbl> <int>
1 1 salary 10000 0
2 2 salary 15000 1
3 3 salary 20000 2
4 4 salary 40000 2
5 5 salary 50000 2
Upvotes: 1
Reputation: 329
In a data.table
this would look like
raw_data[, flag0 := (salary > 15000) + (expenditure > 10000)]
raw_data[, flag := sum(flag0), by = "id"]
Here flag0
is the flag by row (which can later be deleted if you like) and flag
would be the final result.
Edit: Seeing your reply to @Matt, you seem to want the total amount by salary and expenditure seperately. You can do something like
raw_data[, flag_salary := as.integer(salary > 15000)]
raw_data[, flag_expenditure := as.integer(expenditure > 10000)]
raw_data[, flag_salary := sum(flag_salary), by = "id"]
raw_data[, flag_expenditure := sum(flag_expenditure), by = "id"]
Upvotes: 0