Forge
Forge

Reputation: 1677

count distinct and calculate percent differences in two columns ifrom R dataframe

have a simple dataframe listing types of products and tag number per a couple of months:

data.frame(
  stringsAsFactors = FALSE,
                month = c("jan","jan","jan","jan",
                       "jan","feb","feb","feb","feb"),
                category = c("TB", "GT", "TB", "YT", "GT", "TB", "GT", "TB", "YT"),
                tag_number = c(101L, 101L, 223L, 223L, 223L, 345L, 345L, 655L, 223L)
)

month category tag_number

jan      TB     101
jan      GT     101
jan      TB     223
jan      YT     223
jan      GT     223
feb      TB     345
feb      GT     345
feb      TB     655
feb      YT     223

I want to compare and extract the percentual difference between the unique tag_number values by month and category.

Let me explain a bit more as this seems a complicated issue.

If we get grouped month and categories we get to compare this table (month+category)

Jan  TB  101, 223    vs.    Feb  TB  345, 655
Jan  GT  101, 223    vs.    Feb  GT  345
Jan  YT  223         vs.    Feb  YT  223

In January the TB category has two unique tag numbers (101 and 223). If you compare with February there are two unique tags too, but none of them are equal so the percent difference between months is 100% and the count distinct is two in both cases.

Same for GT category. All tags are different. So again 100%.

Different case is YT. Both months contained the same tag number so the difference is 0%

Here expected result. The percent diff is the percentage of cases distinct between months.

Let's take TB category.

4 unique values total, 2 unique per month, none of them equal so:

4 / 4 = 1 (so 100%)

category   pct_diff
TB           100%
GT           100%
YT             0%

Upvotes: 3

Views: 436

Answers (6)

Onyambu
Onyambu

Reputation: 79228

df %>%
   group_by(category) %>%
   summarise(perc_diff = 100 * mean(table(tag_number) == 1))

# A tibble: 3 x 2
  category perc_diff
  <chr>        <dbl>
1 GT             100
2 TB             100
3 YT               0

Upvotes: 2

Jim
Jim

Reputation: 191

Super interesting puzzle. First thing I did was to expand the test data a little and use actual dates rather than character strings so I could utilize the natural ordering. I added a few more test cases just for laughs.

library(dplyr)
library(purrr)
library(tidyr)

df <- data.frame(
  stringsAsFactors = FALSE,
  month = c(as.Date("2019-01-01"),as.Date("2019-01-01"),as.Date("2019-01-01"),as.Date("2019-01-01"),
            as.Date("2019-01-01"),as.Date("2019-01-01"),as.Date("2019-01-01"),as.Date("2019-02-01"),
            as.Date("2019-02-01"),as.Date("2019-02-01"),as.Date("2019-02-01"),as.Date("2019-02-01"),
            as.Date("2019-02-01"),as.Date("2019-03-01"),as.Date("2019-03-01"),as.Date("2019-03-01"),
            as.Date("2019-03-01"),as.Date("2019-04-01"),as.Date("2019-04-01"),as.Date("2019-04-01"),
            as.Date("2019-04-01"),as.Date("2019-04-01")),
  category = c("TB", "GT", "TB", "YT", "GT", "TB", "GT", "ZW", "YT", "TB", "GT", "TB", "YT", "GT", 
               "TB", "GT", "TB", "YT", "TB", "YT", "TB", "GT"),
  tag_number = c(101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 655L, 345L, 345L, 345L, 345L, 
                 223L, 655L, 345L, 223L, 345L, 655L, 101L, 379L, 345L))

Then I considered your question from the set theory perspective. The question you are asking is essentially "What is the percentage new tags occurring this month (setdiff) of the total tags over this and the preceding month (union)?" Here is a little function to answer that question along with minimal error handling for the first month.

g <- function(t)
  {
    r <- c()
    for (i in seq_along(t))
      {
        if (i == 1 & length(t[[i]] > 0))             { r <- c(1) } 
          else if (i == 1 & length(t[[i]]) == 0)     { r <- c(0) }
          else { r <- c(r, length(Reduce(setdiff, list(t[[i]], t[[i-1]]))) 
                              / length(union(t[[i-1]], t[[i]]))) }
      }
    return(r)
  }

Then, I used pivot_wider to put the data into a form with the date as a unique identifier and each category in a column, and finally applied the function to each of the columns. Et voilà.

s <- df %>% group_by(month) %>%
            unique() %>%
            pivot_wider(id_cols = month, 
                     names_from = category, 
                    values_from = tag_number, 
                      values_fn = list) %>%
            map_if(is.list, ~ g(.x))

List of 5
 $ month: Date[1:4], format: "2019-01-01" "2019-02-01" ...
 $ TB   : num [1:4] 1 0.5 0.667 0.333
 $ GT   : num [1:4] 1 0.5 0.5 0
 $ YT   : num [1:4] 1 0.667 0 1
 $ ZW   : num [1:4] 0 1 0 NaN

Upvotes: 0

PaulS
PaulS

Reputation: 25323

A possible solution, based on dplyr:

library(dplyr)

df <- data.frame(
  stringsAsFactors = FALSE,
  month = c("jan","jan","jan","jan",
            "jan","feb","feb","feb","feb"),
  category = c("TB", "GT", "TB", "YT", "GT", "TB", "GT", "TB", "YT"),
  tag_number = c(101L, 101L, 223L, 223L, 223L, 345L, 345L, 655L, 223L)
)

df %>% 
  group_by(category) %>% 
  summarise(
    pct_diff = 100 * (1 - 2*length(
    intersect(tag_number[month == "jan"], tag_number[month == "feb"]))
    / length(tag_number)), .groups = "drop")

#> # A tibble: 3 × 2
#>   category pct_diff
#>   <chr>       <dbl>
#> 1 GT            100
#> 2 TB            100
#> 3 YT              0

Upvotes: 1

Wimpel
Wimpel

Reputation: 27732

I tried to come up with a solution that will show you the percentage of unie values based on month-pairs that scales to more than two months.

The code can be shortened by combining rows, but I kept it like this so you can easily inspect the results of each step, to help you understand what is actually happening.

library(data.table)
# make mydata a data.table
setDT(mydata)
mydata[, month_f := factor(month, labels = c("jan", "feb"))]
#build list columns 'tags', containing a list of all tag-values for that month + category
tags <- mydata[, .(alltags = as.list(.SD)), by = .(month_f, category)]
# build lookup table with all month >> month combinations
ans <- CJ(month  = unique(mydata$month_f), 
   month2 = unique(mydata$month_f),
   category = unique(mydata$category))
# join in the tags for month and month2
ans[tags, tags  := i.alltags, on = .(month = month_f, category)]
ans[tags, tags2 := i.alltags, on = .(month2 = month_f, category)]
# find overlapping values between tags and tags2 list columns
ans[, tags_overlap := Map(intersect, tags, tags2)][]
# calculate the percentage of the overlap
myfun <- function(x, y) {
  paste0(100 * (1 - (length(x) / length(y))), "%")
}
ans[, unique := Map(myfun, tags_overlap, tags)][]
# cast to wide format
dcast(ans, category + month ~ month2, value.var = "unique")
#    category month  jan  feb
# 1:       GT   jan   0% 100%
# 2:       GT   feb 100%   0%
# 3:       TB   jan   0% 100%
# 4:       TB   feb 100%   0%
# 5:       YT   jan   0%   0%
# 6:       YT   feb   0%   0%

Upvotes: 1

ThomasIsCoding
ThomasIsCoding

Reputation: 101373

You can try the base R code below

transform(
  reshape(
    aggregate(
      tag_number ~ .,
      df,
      toString
    ),
    direction = "wide",
    idvar = "category",
    timevar = "month"
  ),
  pct_diff = mapply(
    function(...) {
      sprintf("%s%%", mean(table(unlist(c(...))) == 1) * 100)
    },
    strsplit(tag_number.feb, ", "),
    strsplit(tag_number.jan, ", ")
  )
)

which gives

  category tag_number.feb tag_number.jan pct_diff
1       GT            345       101, 223     100%
3       TB       345, 655       101, 223     100%
5       YT            223            223       0%

Upvotes: 1

Pete Kittinun
Pete Kittinun

Reputation: 603

Try this solution, but I am not sure whether it can be generalized to a larger case. The approach is to create a wide table then calculate difference percentage.

library(dplyr)
library(tidyr)
library(purrr)

df <- data.frame(
  stringsAsFactors = FALSE,
  month = c("jan","jan","jan","jan",
            "jan","feb","feb","feb","feb"),
  category = c("TB", "GT", "TB", "YT", "GT", "TB", "GT", "TB", "YT"),
  tag_number = c(101L, 101L, 223L, 223L, 223L, 345L, 345L, 655L, 223L)
 )

df2 <- df %>% 
  pivot_wider(names_from = "month", values_from = "tag_number") %>% 
  mutate(both_months = map2(jan, feb, ~intersect(.x,.y))) %>%  #find intersect
  mutate(all_lenght = map2(jan,feb, ~union(.x,.y))) %>%  #find all length
  mutate(percent_diff = map2(both_months, all_lenght, 
                             ~(100 - length(.x)*100/length(.y))) %>% 
                        unlist()) # calculate difference percentage

df2

#> # A tibble: 3 x 6
#>   category jan       feb       both_months all_lenght percent_diff
#>   <chr>    <list>    <list>    <list>      <list>            <dbl>
#> 1 TB       <int [2]> <int [2]> <int [0]>   <int [4]>           100
#> 2 GT       <int [2]> <int [1]> <int [0]>   <int [3]>           100
#> 3 YT       <int [1]> <int [1]> <int [1]>   <int [1]>             0

Upvotes: 1

Related Questions