Reputation: 1332
Dummy data:
set.seed(4)
name <- sample(LETTERS[1:8], 500, replace = T)
id <- round(runif(500, min=1, max=200))
df <- data.frame(name, id)
I want to check the % of unique id
of B
which are there for other remaining name
The expected output will be something like this:
name count pct_common
<chr> <int> <dbl>
1 A 17 29.3
2 C 18 31.0
3 D 16 27.6
4 E 22 37.9
5 F 14 24.1
6 G 16 27.6
7 H 20 34.5
My approach so far:
the_name <- 'B'
#Selecting the unique name, id combination for 'B'
df %>%
filter(name %in% the_name) %>%
distinct(name, id)-> list_id
#Checking which of these ids are already there for other names and then count them.
df %>%
filter( id %in% list_id$id) %>%
filter(!name %in% the_name) %>%
group_by(name) %>%
summarise(count=n()) %>%
mutate(pct_common= count/nrow(list_id)*100)
It is getting the job done but creating a separate data frame like this doesn't seem very elegant. Also, it is taking more time for a comparatively large data frame (Millions of observations).
Is there a better way to approach this problem?
Upvotes: 2
Views: 118
Reputation: 887108
We could do this in a single pipe. Create a logical vector ('i1') as column, get the number of distinct elements of 'id' based on 'i1' as 'n1', then do the filter
in a single step, count
and get the percentage by dividing the 'count' with 'n1'
library(dplyr)
df %>%
mutate(i1 = name %in% the_name, n1 = n_distinct(id[i1])) %>%
filter(id %in% id[i1], !i1) %>%
count(name, n1, name = 'count') %>%
mutate(pct_common= count/n1*100, n1 = NULL)
-output
name count pct_common
1 A 17 29.31034
2 C 18 31.03448
3 D 16 27.58621
4 E 22 37.93103
5 F 14 24.13793
6 G 16 27.58621
7 H 20 34.48276
NOTE: The OP asked about It is getting the job done but creating a separate data frame like this doesn't seem very elegant. Also, it is taking more time for a comparatively large data frame
The above code does that in 5 steps and doesn't do the same calculation multiple times i.e. name %in% the_name
if the data is really big, can use collapse
library(collapse)
n1 <- fndistinct(df$id[df$name %in% the_name])
ss(df, id %in% id[name %in% the_name] & !name %in% the_name) %>%
fnobs(g = .$name, drop = FALSE) %>%
tfm(pct_common = 100 *name/n1) %>%
frename(name = count) %>%
tfm(id = NULL)
count pct_common
A 17 29.31034
C 18 31.03448
D 16 27.58621
E 22 37.93103
F 14 24.13793
G 16 27.58621
H 20 34.48276
Upvotes: 2
Reputation: 388982
Here is another option -
library(dplyr)
df %>%
mutate(temp = n_distinct(id[name %in% the_name])) %>%
filter(id %in% unique(id[name %in% the_name]) & !name %in% the_name) %>%
group_by(name, temp) %>%
summarise(count = n(), .groups = 'drop') %>%
mutate(pct_common = count/temp * 100) %>%
select(-temp)
# name count pct_common
# <chr> <int> <dbl>
#1 A 17 29.3
#2 C 18 31.0
#3 D 16 27.6
#4 E 22 37.9
#5 F 14 24.1
#6 G 16 27.6
#7 H 20 34.5
Upvotes: 2
Reputation: 8811
Not better, but here another approach that I thought of, maybe it will help you
df %>%
left_join(
df %>%
filter(name == "B") %>%
mutate(B = 1,N = n_distinct(id)) %>%
select(-name) %>%
distinct()
) %>%
group_by(name) %>%
summarise(
count = sum(B,na.rm = TRUE),
N = mean(N,na.rm = TRUE)
) %>%
ungroup() %>%
mutate(pct_common= 100*count/N)
name count N pct_common
<chr> <dbl> <dbl> <dbl>
1 A 17 58 29.3
2 B 65 58 112.
3 C 18 58 31.0
4 D 16 58 27.6
5 E 22 58 37.9
6 F 14 58 24.1
7 G 16 58 27.6
8 H 20 58 34.5
Upvotes: 2