Shibaprasad
Shibaprasad

Reputation: 1332

How do I check the common occurrences of one type of column value category for for other remaining column values in R?

Dummy data:

set.seed(4)
name <- sample(LETTERS[1:8], 500, replace = T)
id <- round(runif(500, min=1, max=200))

df <- data.frame(name, id)

I want to check the % of unique id of B which are there for other remaining name

The expected output will be something like this:

name  count pct_common
  <chr> <int>      <dbl>
1 A        17       29.3
2 C        18       31.0
3 D        16       27.6
4 E        22       37.9
5 F        14       24.1
6 G        16       27.6
7 H        20       34.5

My approach so far:

the_name <- 'B'

#Selecting the unique name, id combination for 'B'

df %>%
  filter(name %in% the_name) %>%
  distinct(name, id)-> list_id

#Checking which of these ids are already there for other names and then count them.

df %>%
  filter( id %in% list_id$id) %>%
  filter(!name %in% the_name) %>%
  group_by(name) %>%
  summarise(count=n()) %>%
  mutate(pct_common= count/nrow(list_id)*100)

It is getting the job done but creating a separate data frame like this doesn't seem very elegant. Also, it is taking more time for a comparatively large data frame (Millions of observations).

Is there a better way to approach this problem?

Upvotes: 2

Views: 118

Answers (3)

akrun
akrun

Reputation: 887108

We could do this in a single pipe. Create a logical vector ('i1') as column, get the number of distinct elements of 'id' based on 'i1' as 'n1', then do the filter in a single step, count and get the percentage by dividing the 'count' with 'n1'

library(dplyr)
df %>%
    mutate(i1 = name %in% the_name, n1 = n_distinct(id[i1])) %>% 
    filter(id %in% id[i1], !i1) %>% 
    count(name, n1, name = 'count') %>%
    mutate(pct_common= count/n1*100, n1 = NULL)

-output

name count pct_common
1    A    17   29.31034
2    C    18   31.03448
3    D    16   27.58621
4    E    22   37.93103
5    F    14   24.13793
6    G    16   27.58621
7    H    20   34.48276

NOTE: The OP asked about It is getting the job done but creating a separate data frame like this doesn't seem very elegant. Also, it is taking more time for a comparatively large data frame The above code does that in 5 steps and doesn't do the same calculation multiple times i.e. name %in% the_name


if the data is really big, can use collapse

library(collapse)
n1 <- fndistinct(df$id[df$name %in% the_name])
ss(df, id %in% id[name %in% the_name] & !name %in% the_name) %>% 
    fnobs(g = .$name, drop = FALSE) %>% 
    tfm(pct_common = 100 *name/n1) %>%
   frename(name = count) %>%
   tfm(id = NULL)
  count pct_common
A    17   29.31034
C    18   31.03448
D    16   27.58621
E    22   37.93103
F    14   24.13793
G    16   27.58621
H    20   34.48276

Upvotes: 2

Ronak Shah
Ronak Shah

Reputation: 388982

Here is another option -

library(dplyr)

df %>%
  mutate(temp = n_distinct(id[name %in% the_name])) %>%
  filter(id %in% unique(id[name %in% the_name]) & !name %in% the_name) %>%
  group_by(name, temp) %>%
  summarise(count = n(), .groups = 'drop') %>%
  mutate(pct_common = count/temp * 100) %>%
  select(-temp)

#  name  count pct_common
#  <chr> <int>      <dbl>
#1 A        17       29.3
#2 C        18       31.0
#3 D        16       27.6
#4 E        22       37.9
#5 F        14       24.1
#6 G        16       27.6
#7 H        20       34.5

Upvotes: 2

Vin&#237;cius F&#233;lix
Vin&#237;cius F&#233;lix

Reputation: 8811

Not better, but here another approach that I thought of, maybe it will help you

df %>% 
  left_join(
    df %>% 
      filter(name == "B") %>% 
      mutate(B = 1,N = n_distinct(id)) %>% 
      select(-name) %>% 
      distinct() 
  ) %>% 
  group_by(name) %>% 
  summarise(
    count = sum(B,na.rm = TRUE),
    N = mean(N,na.rm = TRUE)
    ) %>%
  ungroup() %>% 
  mutate(pct_common= 100*count/N)


  name  count     N pct_common
  <chr> <dbl> <dbl>      <dbl>
1 A        17    58       29.3
2 B        65    58      112. 
3 C        18    58       31.0
4 D        16    58       27.6
5 E        22    58       37.9
6 F        14    58       24.1
7 G        16    58       27.6
8 H        20    58       34.5

Upvotes: 2

Related Questions