Reputation: 1677
have a simple dataframe listing types of products and tag number per a couple of months:
data.frame(
stringsAsFactors = FALSE,
month = c("jan","jan","jan","jan",
"jan","feb","feb","feb","feb"),
category = c("TB", "GT", "TB", "YT", "GT", "TB", "GT", "TB", "YT"),
tag_number = c(101L, 101L, 223L, 223L, 223L, 345L, 345L, 655L, 223L)
)
month category tag_number
jan TB 101
jan GT 101
jan TB 223
jan YT 223
jan GT 223
feb TB 345
feb GT 345
feb TB 655
feb YT 223
I want to compare and extract the percentual difference between the unique tag_number values by month and category.
Let me explain a bit more as this seems a complicated issue.
If we get grouped month and categories we get to compare this table (month+category)
Jan TB 101, 223 vs. Feb TB 345, 655
Jan GT 101, 223 vs. Feb GT 345
Jan YT 223 vs. Feb YT 223
In January the TB category has two unique tag numbers (101 and 223). If you compare with February there are two unique tags too, but none of them are equal so the percent difference between months is 100% and the count distinct is two in both cases.
Same for GT category. All tags are different. So again 100%.
Different case is YT. Both months contained the same tag number so the difference is 0%
Here expected result. The percent diff is the percentage of cases distinct between months.
Let's take TB category.
4 unique values total, 2 unique per month, none of them equal so:
4 / 4 = 1 (so 100%)
category pct_diff
TB 100%
GT 100%
YT 0%
Upvotes: 3
Views: 436
Reputation: 79228
df %>%
group_by(category) %>%
summarise(perc_diff = 100 * mean(table(tag_number) == 1))
# A tibble: 3 x 2
category perc_diff
<chr> <dbl>
1 GT 100
2 TB 100
3 YT 0
Upvotes: 2
Reputation: 191
Super interesting puzzle. First thing I did was to expand the test data a little and use actual dates rather than character strings so I could utilize the natural ordering. I added a few more test cases just for laughs.
library(dplyr)
library(purrr)
library(tidyr)
df <- data.frame(
stringsAsFactors = FALSE,
month = c(as.Date("2019-01-01"),as.Date("2019-01-01"),as.Date("2019-01-01"),as.Date("2019-01-01"),
as.Date("2019-01-01"),as.Date("2019-01-01"),as.Date("2019-01-01"),as.Date("2019-02-01"),
as.Date("2019-02-01"),as.Date("2019-02-01"),as.Date("2019-02-01"),as.Date("2019-02-01"),
as.Date("2019-02-01"),as.Date("2019-03-01"),as.Date("2019-03-01"),as.Date("2019-03-01"),
as.Date("2019-03-01"),as.Date("2019-04-01"),as.Date("2019-04-01"),as.Date("2019-04-01"),
as.Date("2019-04-01"),as.Date("2019-04-01")),
category = c("TB", "GT", "TB", "YT", "GT", "TB", "GT", "ZW", "YT", "TB", "GT", "TB", "YT", "GT",
"TB", "GT", "TB", "YT", "TB", "YT", "TB", "GT"),
tag_number = c(101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 655L, 345L, 345L, 345L, 345L,
223L, 655L, 345L, 223L, 345L, 655L, 101L, 379L, 345L))
Then I considered your question from the set theory perspective. The question you are asking is essentially "What is the percentage new tags occurring this month (setdiff) of the total tags over this and the preceding month (union)?" Here is a little function to answer that question along with minimal error handling for the first month.
g <- function(t)
{
r <- c()
for (i in seq_along(t))
{
if (i == 1 & length(t[[i]] > 0)) { r <- c(1) }
else if (i == 1 & length(t[[i]]) == 0) { r <- c(0) }
else { r <- c(r, length(Reduce(setdiff, list(t[[i]], t[[i-1]])))
/ length(union(t[[i-1]], t[[i]]))) }
}
return(r)
}
Then, I used pivot_wider to put the data into a form with the date as a unique identifier and each category in a column, and finally applied the function to each of the columns. Et voilà.
s <- df %>% group_by(month) %>%
unique() %>%
pivot_wider(id_cols = month,
names_from = category,
values_from = tag_number,
values_fn = list) %>%
map_if(is.list, ~ g(.x))
List of 5
$ month: Date[1:4], format: "2019-01-01" "2019-02-01" ...
$ TB : num [1:4] 1 0.5 0.667 0.333
$ GT : num [1:4] 1 0.5 0.5 0
$ YT : num [1:4] 1 0.667 0 1
$ ZW : num [1:4] 0 1 0 NaN
Upvotes: 0
Reputation: 25323
A possible solution, based on dplyr
:
library(dplyr)
df <- data.frame(
stringsAsFactors = FALSE,
month = c("jan","jan","jan","jan",
"jan","feb","feb","feb","feb"),
category = c("TB", "GT", "TB", "YT", "GT", "TB", "GT", "TB", "YT"),
tag_number = c(101L, 101L, 223L, 223L, 223L, 345L, 345L, 655L, 223L)
)
df %>%
group_by(category) %>%
summarise(
pct_diff = 100 * (1 - 2*length(
intersect(tag_number[month == "jan"], tag_number[month == "feb"]))
/ length(tag_number)), .groups = "drop")
#> # A tibble: 3 × 2
#> category pct_diff
#> <chr> <dbl>
#> 1 GT 100
#> 2 TB 100
#> 3 YT 0
Upvotes: 1
Reputation: 27732
I tried to come up with a solution that will show you the percentage of unie values based on month-pairs that scales to more than two months.
The code can be shortened by combining rows, but I kept it like this so you can easily inspect the results of each step, to help you understand what is actually happening.
library(data.table)
# make mydata a data.table
setDT(mydata)
mydata[, month_f := factor(month, labels = c("jan", "feb"))]
#build list columns 'tags', containing a list of all tag-values for that month + category
tags <- mydata[, .(alltags = as.list(.SD)), by = .(month_f, category)]
# build lookup table with all month >> month combinations
ans <- CJ(month = unique(mydata$month_f),
month2 = unique(mydata$month_f),
category = unique(mydata$category))
# join in the tags for month and month2
ans[tags, tags := i.alltags, on = .(month = month_f, category)]
ans[tags, tags2 := i.alltags, on = .(month2 = month_f, category)]
# find overlapping values between tags and tags2 list columns
ans[, tags_overlap := Map(intersect, tags, tags2)][]
# calculate the percentage of the overlap
myfun <- function(x, y) {
paste0(100 * (1 - (length(x) / length(y))), "%")
}
ans[, unique := Map(myfun, tags_overlap, tags)][]
# cast to wide format
dcast(ans, category + month ~ month2, value.var = "unique")
# category month jan feb
# 1: GT jan 0% 100%
# 2: GT feb 100% 0%
# 3: TB jan 0% 100%
# 4: TB feb 100% 0%
# 5: YT jan 0% 0%
# 6: YT feb 0% 0%
Upvotes: 1
Reputation: 101373
You can try the base R code below
transform(
reshape(
aggregate(
tag_number ~ .,
df,
toString
),
direction = "wide",
idvar = "category",
timevar = "month"
),
pct_diff = mapply(
function(...) {
sprintf("%s%%", mean(table(unlist(c(...))) == 1) * 100)
},
strsplit(tag_number.feb, ", "),
strsplit(tag_number.jan, ", ")
)
)
which gives
category tag_number.feb tag_number.jan pct_diff
1 GT 345 101, 223 100%
3 TB 345, 655 101, 223 100%
5 YT 223 223 0%
Upvotes: 1
Reputation: 603
Try this solution, but I am not sure whether it can be generalized to a larger case. The approach is to create a wide table then calculate difference percentage.
library(dplyr)
library(tidyr)
library(purrr)
df <- data.frame(
stringsAsFactors = FALSE,
month = c("jan","jan","jan","jan",
"jan","feb","feb","feb","feb"),
category = c("TB", "GT", "TB", "YT", "GT", "TB", "GT", "TB", "YT"),
tag_number = c(101L, 101L, 223L, 223L, 223L, 345L, 345L, 655L, 223L)
)
df2 <- df %>%
pivot_wider(names_from = "month", values_from = "tag_number") %>%
mutate(both_months = map2(jan, feb, ~intersect(.x,.y))) %>% #find intersect
mutate(all_lenght = map2(jan,feb, ~union(.x,.y))) %>% #find all length
mutate(percent_diff = map2(both_months, all_lenght,
~(100 - length(.x)*100/length(.y))) %>%
unlist()) # calculate difference percentage
df2
#> # A tibble: 3 x 6
#> category jan feb both_months all_lenght percent_diff
#> <chr> <list> <list> <list> <list> <dbl>
#> 1 TB <int [2]> <int [2]> <int [0]> <int [4]> 100
#> 2 GT <int [2]> <int [1]> <int [0]> <int [3]> 100
#> 3 YT <int [1]> <int [1]> <int [1]> <int [1]> 0
Upvotes: 1