Reputation: 35
I have a huge dataset (~7 Gb) and I need EFFECTIVELY replace one variable (iser id) by relative frequency (i.e. freq(user_id) / unique(user_id)). Minimal example:
id <- c(1050, 1324, 5, 7, 1050, 7, 8)
table(id)
id
5 7 8 1050 1324
1 2 1 2 1
Then I tried
freq <- ave(id, id, FUN = function(X) length(X) / length(unique(id)))
df <- data.frame(id = id, freq = freq)
Output:
id freq
1 1050 0.4
2 1324 0.2
3 5 0.2
4 7 0.4
5 1050 0.4
6 7 0.4
7 8 0.2
But on my data set this solution has been working for three(!) hours already. Any help is appreciated :)
Upvotes: 2
Views: 173
Reputation: 171
Here is a tidyverse implementation:
library(tidyverse)
id <- c(1050, 1324, 5, 7, 1050, 7, 8)
data_frame( id = id)-> my_df # creating df
my_df%>%
mutate(unique = unique(id) %>% length) %>% # addying column unique, with the number of unique id
group_by(id) %>% # group by id
mutate(
n=n(), # number of observations for the current group
freq = n / unique # n / unique gives the freq
)
if you want to learn more about group_by check this tutorial: https://www.youtube.com/watch?v=70UcgabaB_I&t=14s
Upvotes: 2
Reputation: 18838
An option could be using sqldf
:
library(sqldf)
df <- data.frame(id = c(1050,1324,5,7,1050,7,8),
freq = c(0.4,0.2,0.2,0.4,0.4,0.4,0.2))
sqldf('SELECT id, freq / Count(*) FROM df
GROUP BY id, freq')
Upvotes: 0